Skip to content

March 2018

03/31 Sat

Stay humble. Be real.

03/30 Fri

Our scheduling, or IB do have issues. I must revisit this.

The case is: in P, we boot only 12 cores, and three of them are used by flush, sweep, and IB. So there are 9 cores left for user. Phoenix create 24 threads. During the run, a lot ib timeout will happen. If we have a good scheduling, this should never happen. I probably need to check more on this.

Anyway. Today I reorganized the opcode things. And now I’m adding the final large piece of Lego: replication. It should be much simpler than the pcache part. I will first write down what code I need to add, e.g., opcode, handler, buffer mgmt etc.

End of day. Want to write down some simple thoughts on building system. Building system is fun, but you have to know that devil is in the details. And, you may end up debugging for many many hours on a very very little issue. But that is how it is. Building system does not mean you are always working on fantastic beautiful ideas. It is always about those little bugs, little things, trivial fixes, that make your system robust and usable. For example, the patch Yilun sent me today is about handling special cases of stat and lseek. The patch does not improve any performance or adding fancy features, it is a minor fix to make user progam run. But this enable us to run TF. I think it is a great patch and it stands for 90% of building systems in middle or late stage.

Of course, there are other trivial things on building systems: 1) initialize every possible used variables, can be local variables, malloced buffers. 2) have decent cleanup, which is a counterpart of your initialization, like dequeue list, decrease counter etc. 3) Clear coding style, write code for others, for yourself when you read the code two weeks later. This one is hard, need experience. But can be learned. I think Yilun and Yutong both improved a lot during this project. Me? I learned this from NVM emulator protect. It is a painful one, but also a valuable one. 4) Decent protect source file organization. 5) Remember, draw, the connections between subsystems. By adding this new feature to this subsystem A, will it broke subsystem B, which is using subsystem A. Something like this. 6) clear mind on lock usage, multithread issue. This is the most difficult one. I would say I learned this by coding pcache, or mm. I would say, mm is the most difficult multithread issue one can encounter.

03/26 Mon

Spent several days on replication design. Now I’m back on coding and debuging track.

Fixed a bug introduced by per-pte lock. A one hided by previous one big giant page table lock.

Also add an option to boot socket 0 only if Processor is configured. This is because pcache is normally registered at socket 0, if we schedule user threads to sockets other than socket 0, that will have bad performance.

03/22 Thur

Clear Registers for execve()

Want to figure out execve problem today.

  1. Check if pcache is clean after process_exit.
  2. Check if pgtable is clean.

Well. Checked, both are clean.

The bug looks like the return of main, evevntually does not go to library’s exit. Is it because library pages are not loaded properly? Since the number of pgfault equals to normal setting, I guess it may originate from Memory side.

TLB is also flushed, so TLB should not be a hidden issue.

Going to check checsum. Well, checsum is okay too.

Syscall execve will change ip, sp, flags registers. So it will use iretq instead of sysexit to return to userspace.

Got an insteresting IB bug after execve. The CPU5 seems fail to return to userspace, and the CPU0 has the IB bug followed:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
[ 1201.940681] CPU: 5 PID: 32 Comm: seq.o 4.0.0-lego-ys+ #609
[ 1202.006200] RIP: 0033:[<0000000000401d1d>]  [<0000000000401d1d>] 0x401d1d
[ 1202.087320] RSP: 002b:00007fffffffedb0  EFLAGS: 00000200
[ 1202.150760] RAX: 0000000000000000 RBX: 00000000004002e0 RCX: 000000000043b2c7
[ 1202.236041] RDX: 00007fffffffedc8 RSI: 00007fffffffeb40 RDI: 000000000048f9f0
[ 1202.321320] RBP: 00007fffffffeb60 R08: 00000000006ba4a0 R09: 00000000006bc880
[ 1202.406601] R10: 000000000000000f R11: 0000000000000246 R12: 0000000000000000
[ 1202.491881] R13: 0000000000401930 R14: 00000000004019c0 R15: 0000000000000006
[ 1202.577161] FS:  0000000000000000(0000) GS:ffff88207fc40000(0000) knlGS:0000000000000000
[ 1202.673880] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1202.742521] CR2: 000000000042c9a0 CR3: 000000207fc2f000 CR4: 00000000000406a0

[ 1220.465601] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
[ 1220.557225] IP: [<ffffffff810591ef>] ib_mad_completion_handler+0x6f/0x7c0
[ 1220.638344] PGD 0
[ 1220.662265] Oops: 0000 [#1] SMP PROCESSOR
[ 1220.710105] CPU: 0 PID: 27 Comm: ib_mad_completi 4.0.0-lego-ys+ #609
[ 1220.786025] RIP: 0010:[<ffffffff810591ef>]  [<ffffffff810591ef>] ib_mad_completion_handler+0x6f/0x7c0
[ 1220.896265] RSP: 0000:ffff88103eea7e30  EFLAGS: 00010246
[ 1220.959704] RAX: 0000000000000000 RBX: ffff88103eeac728 RCX: 0000000000000001
[ 1221.044985] RDX: 0000000028000000 RSI: ffff88103ee8f000 RDI: ffff88107ff841d8
[ 1221.130265] RBP: ffff88103eea7ec0 R08: 0000000000000000 R09: ffff88103eea03c0
[ 1221.215545] R10: ffff88103eea7ea0 R11: 0000000000000001 R12: ffff88103ee8c3f0
[ 1221.300825] R13: ffff88103ee8c4e8 R14: ffff88103eeac620 R15: ffff88103eeac5f8
[ 1221.386106] FS:  0000000000000000(0000) GS:ffff88107fc00000(0000) knlGS:0000000000000000
[ 1221.482825] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1221.551466] CR2: 0000000000000020 CR3: 000000000113d000 CR4: 00000000000406b0
[ 1221.636746] Stack:
[ 1221.660666] ffff88103eeaac10 ffff881000000001 ffff88103eeaac10 ffff88103eeaab50
[ 1221.748026] ffff88107fc05d80 ffff88103eea0000 ffff88103eeac728 0000008000000000
[ 1221.835386] 000001283eea7ea8 ffff88103ee8c9a8 000000007fcf2000 ffff000000000000
[ 1221.922746] ffff88107fcf0000 ffff88207ff6cbd8 ffff88107fcf76e8 ffff88103ee8c3f0
[ 1222.010106] ffffffff81059180 0000000000000000 ffff88103eea7f48 ffffffff81020866
[ 1222.097466] Call Trace:
[ 1222.126586] <TSK>
[ 1222.149466] [<ffffffff81059180>] ? ib_mad_send_done_handler.isra.21+0x1d0/0x1d0
[ 1222.236826] [<ffffffff81020866>] kthread+0xf6/0x120
[ 1222.295066] [<ffffffff81020770>] ? __kthread_parkme+0x70/0x70
[ 1222.363707] [<ffffffff8100e4b2>] ret_from_fork+0x22/0x30

1
2
3
4
5
[root@wuklab12: LegoOS git:(master)] $ addr2line -e vmImage  -i ffffffff810591ef
/root/ys/LegoOS/drivers/infiniband/core/mad.c:1899
/root/ys/LegoOS/drivers/infiniband/core/mad.c:2324

It is ib_mad_recv_done_handler()

Well…

Eventually, at 22:09, I figured out..

After I cleaned up all registers (except IP, SP, CS, SS, FLAGS) within start_thread, the execve’ed program can run to end successfully.

I did not clear the registers because linux does not clear it. I thought this is fine. Glibc should clear it anyway, right?

But anyway, this works.

03/21 Wed

  • Task 1: add some checking in ib, flush, sweep thread. 1) If cpu changed, 2) if nr_threads on this core > 1.

Had an issue while testing: execve(). I ran a exec.o first, then do execve to run seq.o:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
wuklab13 0321-10
[  970.380252] STDOUT: ---[
uname():

---
[  970.431212] __pcache_do_fill_page(): I pid:32 tgid:32 address:0x44605d flags:0x150
[ 1101.862429] mlx4_ib_handle_error_cqe syndrome 21
[ 1101.915570] mlx4_ib_handle_error_cqe syndrome 5
[ 1101.969649] send request failed at connection 4 as 12
[ 1102.029968] mlx4_ib_handle_error_cqe syndrome 5
[ 1102.084046] mlx4_ib_handle_error_cqe syndrome 5
[ 1102.138125] mlx4_ib_handle_error_cqe syndrome 5
[ 1102.192203] fit_poll_cq: failed status (5) for wr_id 1054
[ 1102.256681] fit_poll_cq: failed status (5) for wr_id 1055
[ 1102.321160] csum: 442a97c0, reply->csum: 2d352c33
[ 1102.377319] fit_poll_cq: connection 4 Recv weird event as -1
[ 1102.444916] pcache:ffff880180011180 mapcount:0 refcount:1 flags:(allocated|usable) kva: ffff880100446000
[ 1102.558273] fit_poll_cq: failed status (5) for wr_id 1056
[ 1102.622751] pcache dumped because: csum mismatch
[ 1102.677871] fit_poll_cq: connection 4 Recv weird event as -30704
[ 1102.749627] ------------[ cut here ]------------
[ 1102.804746] fit_poll_cq: failed status (5) for wr_id 1057
[ 1102.869225] BUG: failure at managers/processor/pcache/fault.c:237/__pcache_do_fill_page()!
[ 1102.968022] fit_poll_cq: connection 4 Recv weird event as -30704
[ 1103.039780] Kernel Panic - not syncing: BUG!
[ 1103.090739] CPU: 5 PID: 32 Comm: seq.o 4.0.0-lego-ys+ #599
[ 1103.156256] Stack:
[ 1103.180177] ffff88103e85be18 ffffffff8102676c ffffffff00000008 ffff88103e85be28
[ 1103.267533] ffff88103e85bde0 0000000021475542 0000000000000296 ffff88103e85ba10
[ 1103.354892] ffffffff810195c5 ffff88207fc44980 0000000000000005 ffff88103e85ba28
[ 1103.442249] ffffffff81016c75 ffff88103e85ba40 ffff88103e85ba50 ffffffff810065d4
[ 1103.529607] ffffffff811d36e0 0000000000000039 ffffffff81081718 ffff88103e85bb80
[ 1103.616964] Call Trace:

03/20 Tue

  • Task 1: calculate failure numbers
  • Task 2: read 0319-4 Log
  • Task 3: opt pte lock

Hmm, I finished the per-pte per-pmd lock patch. I think it works. But I do found an issue. When I run MT+2GB, it will create 24 threads. Since I marked 3 CPUs inactive, so all new 24 threads will be scheduled to other cores (I may need to check this!). At some point, Lego P either stuck, or a lot ibapi_send_reply timeout.

When I change the cpu_online to may 0-6, it finished. When I change it to 0-18, also succeed. I really doubt if actually those pinned threads are not pinned. Need to check.

IB Bug again

Running MT-phoenix, 2GB, somehow crashed in the middle:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
[60095.857381] SYSC_close() CPU6 PID:33 [fd: 4] -> [/proc/stat]

[60286.127359] mlx4_ib_handle_error_cqe syndrome 21
[60286.180503] mlx4_ib_handle_error_cqe syndrome 5
[60286.234582] send request failed at connection 4 as 12
[60286.294903] mlx4_ib_handle_error_cqe syndrome 5
[60286.348981] mlx4_ib_handle_error_cqe syndrome 5
[60286.403062] mlx4_ib_handle_error_cqe syndrome 5
[60286.457141] mlx4_ib_handle_error_cqe syndrome 5
[60286.511221] send request failed at connection 4 as 5
[60286.570500] fit_poll_cq: failed status (5) for wr_id 1056
[60286.634980] mlx4_ib_handle_error_cqe syndrome 5
[60286.689059] fit_poll_cq: failed status (5) for wr_id 1057
[60286.753539] send request failed at connection 4 as 5
[60286.812819] fit_poll_cq: failed status (5) for wr_id 1058
[60286.877298] mlx4_ib_handle_error_cqe syndrome 5
[60286.931378] fit_poll_cq: failed status (5) for wr_id 1059
[60286.995857] send request failed at connection 4 as 5
[60287.055138] mlx4_ib_handle_error_cqe syndrome 5
[60287.109217] mlx4_ib_handle_error_cqe syndrome 5
[60287.163297] mlx4_ib_handle_error_cqe syndrome 5
[60287.217376] mlx4_ib_handle_error_cqe syndrome 5
[60287.271456] mlx4_ib_handle_error_cqe syndrome 5
[60287.325536] send request failed at connection 4 as 5
[60287.384815] fit_poll_cq: failed status (5) for wr_id 1060
[60287.449294] mlx4_ib_handle_error_cqe syndrome 5
[60287.503375] BUG: unable to handle kernel NULL pointer dereference at           (null)
[60287.596973] IP: [<ffffffff81063ffd>] fit_poll_cq+0x4dd/0x530
[60287.664574] send request failed at connection 4 as 5
[60287.723853] PGD 0
[60287.747772] mlx4_ib_handle_error_cqe syndrome 5
[60287.801852] Oops: 0002 [#1] PREEMPT SMP PROCESSOR
[60287.858013] send request failed at connection 4 as 5
[60287.917292] CPU: 2 PID: 29 Comm: recvpollcq 4.0.0-lego-ys+ #569
[60287.988010] RIP: 0010:[<ffffffff81063ffd>]  [<ffffffff81063ffd>] fit_poll_cq+0x4dd/0x530
[60288.084731] RSP: 0000:ffff88103e84fd88  EFLAGS: 00010206
[60288.148170] RAX: 0000000000001008 RBX: ffff88103e848438 RCX: 0000000000000014
[60288.233450] RDX: 0000000000000000 RSI: ffffffff811d36e0 RDI: ffffffff811dac08
[60288.318728] RBP: ffff88103e84fea8 R08: 0000000000000000 R09: 0000000000000000
[60288.404008] R10: 0000000000000002 R11: 0000000000000004 R12: 0000000000000000
[60288.489288] R13: ffff88207fd6e008 R14: 0000000000000004 R15: ffff88103e84fda0
[60288.574568] mlx4_ib_handle_error_cqe syndrome 5
[60288.628647] FS:  0000000000000000(0000) GS:ffff88107fc20000(0000) knlGS:0000000000000000
[60288.725367] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[60288.794006] CR2: 0000000000000000 CR3: 000000000113d000 CR4: 00000000000406a0
[60288.879285] send request failed at connection 4 as 5
[60288.938565] Stack:
[60288.962484] ffffffff810031d9 000801d43e84fda0 0000000000000007 0000000000000424
[60289.049844] 0000008100000005 00001008000000f9 ffff88103e848868 00616e6440000014
[60289.137204] 0020004000000002 ffff88207fc00000 0000000000000425 0000008100000005
[60289.224563] 00001008000000f9 ffff88103e848868 007370654000000d 0010004000000002
[60289.311922] ffffffff81010000 0000000000000426 0000008100000005 00001008000000f9
[60289.399282] Call Trace:
[60289.428402] mlx4_ib_handle_error_cqe syndrome 5
[60289.482482] <TSK>
[60289.505361] [<ffffffff810031d9>] ? native_smp_send_reschedule+0x39/0x50
[60289.584400] send request failed at connection 4 as 5
[60289.643680] [<ffffffff81010000>] ? __ioremap_caller+0x170/0x570
[60289.714400] [<ffffffff81060000>] ? cm_work_handler+0x270/0x1450
[60289.785119] [<ffffffff81064050>] ? fit_poll_cq+0x530/0x530
[60289.850639] [<ffffffff81064064>] fit_poll_cq_pass+0x14/0x30
[60289.917198] [<ffffffff81020c06>] kthread+0xf6/0x120
[60289.975438] mlx4_ib_handle_error_cqe syndrome 5
[60290.029518] [<ffffffff81020b10>] ? __kthread_parkme+0x70/0x70
[60290.098157] [<ffffffff8100e722>] ret_from_fork+0x22/0x30

Uuh:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
[ 1002.803051] mlx4_ib_handle_error_cqe syndrome 1
[ 1002.855153] mlx4_ib_handle_error_cqe syndrome 5
[ 1002.909232] mlx4_ib_handle_error_cqe syndrome 5
[ 1002.963310] mlx4_ib_handle_error_cqe syndrome 5
[ 1003.017390] fit_poll_cq: failed status (1) for wr_id 512
[ 1003.080829] BUG: unable to handle kernel NULL pointer dereference at 0000000000000200
[ 1003.174425] IP: [<ffffffff8105d499>] fit_poll_cq+0x179/0x510
[ 1003.242024] PGD 0
[ 1003.265943] Oops: 0000 [#1] SMP MEMORY
[ 1003.310661] CPU: 2 PID: 29 Comm: recvpollcq 4.0.0-lego-ys+ #149
[ 1003.381380] RIP: 0010:[<ffffffff8105d499>]  [<ffffffff8105d499>] fit_poll_cq+0x179/0x510
[ 1003.478098] RSP: 0000:ffff88104e84fd88  EFLAGS: 00010246
[ 1003.541537] RAX: ffff880000000000 RBX: ffff88104e848008 RCX: 0000000000000080
[ 1003.626814] RDX: 0000000000000200 RSI: ffffffff811c76e0 RDI: ffffffff811d0988
[ 1003.712092] RBP: ffff88104e84fea8 R08: 0000000000000000 R09: 0000000000000000
[ 1003.797369] R10: 0000000000000002 R11: 0000000000000004 R12: 0000000000000000
[ 1003.882648] R13: ffff88207ff75008 R14: 0000000000000004 R15: ffff88104e84fda0
[ 1003.967925] FS:  0000000000000000(0000) GS:ffff88107fc20000(0000) knlGS:0000000000000000
[ 1004.064644] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1004.133282] CR2: 0000000000000200 CR3: 0000000001131000 CR4: 00000000000406a0
[ 1004.218559] Stack:
[ 1004.242479] ffffffff810031a9 ffff88104e84fda0 ffffffff81018ef4 0000000000000200
[ 1004.329837] 0000008000000001 00000048000000d7 ffff88104e848c98 0000000081019302
[ 1004.417194] 0014000000000000 ffff88207fc00000 0000000000000201 ffffffff00000005
[ 1004.504552] ffff8810000000f9 ffff88104e848c98 0000000000000000 ffff88104e84fe38
[ 1004.591910] ffffffff810195a4 0000000000000202 ffff881000000005 ffff8810000000f9
[ 1004.679268] Call Trace:
[ 1004.708388] <TSK>
[ 1004.731267] [<ffffffff810031a9>] ? native_smp_send_reschedule+0x39/0x50
[ 1004.810305] [<ffffffff81018ef4>] ? resched_curr+0x34/0x40
[ 1004.874783] [<ffffffff810195a4>] ? try_to_wake_up+0xe4/0x1f0
[ 1004.942382] [<ffffffff8105f458>] ? __schedule+0xf8/0x1e0
[ 1005.005820] [<ffffffff8105d830>] ? fit_poll_cq+0x510/0x510
[ 1005.071338] [<ffffffff8105d844>] fit_poll_cq_pass+0x14/0x30
[ 1005.137897] [<ffffffff8101fdc6>] kthread+0xf6/0x120
[ 1005.196135] [<ffffffff8101fcd0>] ? __kthread_parkme+0x70/0x70
[ 1005.264773] [<ffffffff8100e472>] ret_from_fork+0x22/0x30

03/19 Mon

Not too many days left!!! Got to design full replication mechanism and algorithm today.

Merged pull request for pipe, pipe2 and /dev/null from Yilun. Our simple file op mgmt concerns me. I left a note at kernel/fork.c.

Got a bug report from Yilun, syscall execv failed. To be honest, I’ve never tried this syscall, always call it directly within kernel.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
[  943.650712] CPU6 PID17 sys_execve+0x0/0x10
[  943.701899] BUG: unable to handle kernel paging request at 0000000000490523
[  943.702776] IP: [<ffffffff8103db86>] strrchr+0x6/0x20
[  943.711501] PGD 0
[  943.711911] Oops: 0000 [#1] SMP PROCESSOR
[  943.712433] CPU: 6 PID: 17 Comm: word_count-pthr 4.0.0-lego+ #64
[  943.713126] RIP: 0010:[<ffffffff8103db86>]  [<ffffffff8103db86>] strrchr+0x6/0x20
[  943.714090] RSP: 0018:ffff88083e4bfe98  EFLAGS: 00010246
[  943.714724] RAX: 0000000000000000 RBX: ffff88083e4b3780 RCX: 0000000000000000
[  943.715511] RDX: 00000000ffffffff RSI: 000000000000002f RDI: 0000000000490523
[  943.716297] RBP: ffff88083e4bfe98 R08: 0000160000000000 R09: ffff88083e4b8400
[  943.717085] R10: ffff880000000000 R11: 6db6db6db6db6db7 R12: ffff88083e4b8000
[  943.717871] R13: ffff88083e4e6290 R14: 0000000000490523 R15: ffff88083e4b3920
[  943.718683] FS:  0000000000000000(0000) GS:ffff88083fd80000(0000) knlGS:0000000000000000
[  943.719650] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  943.720319] CR2: 0000000000490523 CR3: 000000083e4e7000 CR4: 00000000000406a0
[  943.721106] Stack:
[  943.721459] ffff88083e4bff18 ffffffff8102c6bf ffff880800000000 0000000000000e10
[  943.722541] 00007fffffffedb0 0000000000400d0d ffff88083e4c0000 0000000000490523
[  943.723624] ffff88083e4b9008 00007fffffffed30 0000008400000084 ffff88083e4bff58
[  943.724706] 000000000000003b 00000000004019d0 0000000000401a60 0000000000000000
[  943.725789] ffff88083e4bff28 ffffffff8102c989 ffff88083e4bff48 ffffffff8100e5f5
[  943.726870] Call Trace:
[  943.727260] <TSK>
[  943.727619] [<ffffffff8102c6bf>] do_execve+0x4af/0x770
[  943.728236] [<ffffffff8102c989>] sys_execve+0x9/0x10
[  943.728868] [<ffffffff8100e5f5>] do_syscall_64+0x45/0xd0
[  943.729499] [<ffffffff8100d4ec>] entry_SYSCALL64_slow_path+0x25/0x25
[  943.730222] <EOT>
[  943.730570] Code: d2 74 18 40 38 f2 89 f1 75 06 eb 0f 38 ca 74 0b 48 83 c0 01 0f b6 10 84 d2 75 f1 5d c3 0f 1f 84 00 00 00 00 00 55 31 c0 48 89 e5 <0f> b6 17 40 38 f2 48 0f 44 c7 48 83 c7 01 84 d2 75 ee 5d c3 66
[  943.735455] RIP  [<ffffffff8103db86>] strrchr+0x6/0x20
[  943.736120]  RSP <ffff88083e4bfe98>
[  943.736598] CR2: 0000000000490523

It is setup_new_exec() -> set_task_comm(). I passed the user pointer to set_task_comm(), which I should pass a kernel pointer.

And I actually found we missed a function: do_close_on_exec(). I also add a note above.

Random IB Bug

Another weird bug after pathing loader. Actually, I tried the same setting twice, the second time it works. I guess this is some random IB bug. (Setting: 1P, 1M, 1S. Running a simple exec.c testing program, this have not reach that point yet.)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
wuklab13 0319-2
[  496.288272] p2m_fork(cpu0): I cur:1-kernel_init new:31
[  496.349624] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
[  496.443216] IP: [<ffffffff81064935>] fit_send_reply_with_rdma_write_with_imm+0x65/0x3b0
[  496.538892] PGD 0
[  496.562811] Oops: 0002 [#1] PREEMPT SMP PROCESSOR
[  496.618968] CPU: 0 PID: 1 Comm: kernel_init 4.0.0-lego-ys+ #559
[  496.689684] RIP: 0010:[<ffffffff81064935>]  [<ffffffff81064935>] fit_send_reply_with_rdma_write_with_imm+0x65/0x3b0
[  496.814478] RSP: 0000:ffff88107fcf7d00  EFLAGS: 00010202
[  496.877915] RAX: 000000000000004c RBX: 0000000000000004 RCX: 000000000000002c
[  496.963190] RDX: 0000000000000004 RSI: 0000000000000001 RDI: ffff88207ff6d008
[  497.048466] RBP: ffff88107fcf7d98 R08: ffff88107fcf7e3c R09: 0000000000000004
[  497.133742] R10: ffffffff81145fe0 R11: 000000000000001c R12: 000000000000002c
[  497.219018] R13: 0000000000000001 R14: ffff88107fcf7e40 R15: ffff88207ff6d008
[  497.304293] FS:  0000000000000000(0000) GS:ffff88107fc00000(0000) knlGS:0000000000000000
[  497.401009] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  497.469645] CR2: 0000000000000004 CR3: 000000000113d000 CR4: 00000000000406b0
[  497.554922] Stack:
[  497.578840] ffff88107fcf7d08 0000000000000000 0000000000000282 ffffffff81077b10
[  497.666195] 000000000000003a 000000047fcf7e18 ffff88107fcf7e3c ffff88107fd5ed88
[  497.753552] 000000010000002c ffffff9b00000040 0000000000000034 ffffffff81145fe0
[  497.840906] ffff88107fcf7db0 0000000000000297 ffff88107fd5ed88 000000000000002c
[  497.928263] ffff88107fcf7e3c ffff88107fcf7e40 0000000000000039 ffff88107fcf7dc8
[  498.015618] Call Trace:
[  498.044736] <TSK>
[  498.067615] [<ffffffff810622ff>] ibapi_send_reply_timeout+0x3f/0x50
[  498.142492] [<ffffffff8103b0d4>] ? net_send_reply_timeout+0x94/0x132
[  498.218408] [<ffffffff8103b0d4>] net_send_reply_timeout+0x94/0x132
[  498.292244] [<ffffffff8102c683>] p2m_fork+0xd3/0x200
[  498.351521] [<ffffffff8101f490>] do_fork+0xf0/0x150
[  498.409758] [<ffffffff8101f514>] kernel_thread+0x24/0x30
[  498.473195] [<ffffffff8115bf21>] processor_manager_init+0x21/0x50
[  498.545991] [<ffffffff81000354>] kernel_init+0x94/0x120
[  498.608388] [<ffffffff810002c0>] ? 0xffffffff810002c0
[  498.668706] [<ffffffff81019b0a>] ? schedule_tail+0xa/0x40
[  498.733182] [<ffffffff810002c0>] ? 0xffffffff810002c0
[  498.793499] [<ffffffff8100e762>] ret_from_fork+0x22/0x30
[  498.856936] <EOT>

03/18 Sun

Got a bug report after enable preempt and sweep thread

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
[  582.545444] pcache:ffff8801812cb680 mapcount:1 refcount:2 flags:(locked|allocated|usable|valid|reclaim) kva: ffff88014b2da000
[  582.678677] pcache dumped because: PCACHE_BUG_ON_PCM(pcache_mapped(pcm))
[  582.758760] rmap:ffff88207e5e37e8 flags:0x0 owner-tgid:33 user_va:0x7fff0b2da000 ptep:ffff88207e4a86d0
[  582.870046] pte:ffff88207e4a86d0 pfn:0x0 flags:()
[  582.926210] ------------[ cut here ]------------
[  582.981333] BUG: failure at managers/processor/pcache/victim.c:604/victim_finish_insert()!
[  583.080137] Kernel Panic - not syncing: BUG!
...
...
[  588.847239] nr_pgfault: 591101
[  588.883641] nr_clflush: 66176
[  588.919003] nr_pgfault_wp: 0
[  588.953325] nr_pgfault_wp_cow: 0
[  588.991806] nr_pgfault_wp_reuse: 0
[  589.032368] nr_pgfault_due_to_concurrent_eviction: 0
[  589.091651] nr_pcache_fill_from_memory: 587057
[  589.144694] nr_pcache_fill_from_victim: 4038
[  589.195656] nr_pcache_eviction_triggered: 439562
[  589.250780] nr_pcache_eviction_eagain_freeable: 373382
[  589.312143] nr_pcache_eviction_eagain_concurrent: 0
[  589.370386] nr_pcache_eviction_failure_find: 0
[  589.423429] nr_pcache_eviction_failure_evict: 0
[  589.477512] nr_pcache_eviction_succeed: 66176
[  589.529514] nr_victim_eviction_triggered: 733361
[  589.584638] nr_victim_eviction_eagain: 671227
[  589.636640] nr_victim_eviction_succeed: 62134
[  589.688642] nr_victim_prepare_insert: 66180
[  589.738566] nr_victim_finish_insert: 66176
[  589.787447] nr_victim_flush_submitted: 66176
[  589.838411] nr_victim_flush_finished: 66176
[  589.888332] nr_victim_flush_async_run: 0
[  589.935135] nr_victim_flush_sync: 0
[  589.976738] nr_sweep_run: 50580
[  590.014179] nr_sweep_nr_pset: 116770383
[  590.059943] nr_sweep_nr_moved_pcm: 100686435

This is an interesting bug. Two threads, one doing munmap or mremap, one doing eviction. They are using the same pcm. munmap and mremap will use pte_get_and_clear() to get the pcm. While eviction will call pcache_try_to_unamp, which will further call rmap_get_locked_pte(), in which we check if the pte is none, if it is, then we know this is under munmap or mremap, then we skip. This is absolutely wrong. When pcache_try_to_unamp is called by eviction, it should always unmap ALL rmap. The above case is triggered because both two threads skip the final __pcache_remove_rmap.

Hmm, looks like open/close filename is wrong. I need to check.

Last Log from MT+2GB, computation finished:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
wuklab13 0318-10
[  627.280016]
****    ERROR:
***     current: 32:kevict_sweepd caller:           (null)
****    [pte == rmap->page_table] && [pcache_pfn != pte_pfn]
****    rmap->owner_process: word_count-pthr uva: 0x7fff78f52000 ptep: ffff88107e87fa90, rmap->page_table: ffff88107e87fa90
****    pcache_pfn: 0x168f52, pte_pfn: 0x178f52

[  627.624239] rmap:ffff88107dc73740 flags:0x0 owner-tgid:33 user_va:0x7fff78f52000 ptep:ffff88107e87fa90
[  627.735513] pte:ffff88107e87fa90 pfn:0x0 flags:()
[  627.791670] pcache_rmap dumped because: Corrupted RMAP
[  627.853026] pcache:ffff880181a3d480 mapcount:1 refcount:2 flags:(locked|allocated|usable|valid) kva: ffff880168f52000
[  627.979901] pcache dumped because: Corrupted RMAP
[  628.036057] ------------[ cut here ]------------
[  628.091175] BUG: failure at managers/processor/pcache/rmap.c:109/report_bad_rmap()!
[  628.182691] Kernel Panic - not syncing: BUG!
[  628.233647] CPU: 5 PID: 32 Comm: kevict_sweepd 4.0.0-lego-ys+ #543
[  628.307483] Stack:
[  628.331401] ffff88107e85bd00 ffffffff81026d24 0000000000000008 ffff88107e85bd10
[  628.418756] ffff88107e85bcc8 0000000021475542 0000000000000000 0000000000000000
[  628.506113] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  628.593468] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  628.680823] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  628.768179] Call Trace:
[  628.797299] <TSK>
[  628.820176] [<ffffffff81026d30>] panic+0xc2/0x102
[  628.876334] [<ffffffff8101c6ac>] ? task_tick_rt+0x2c/0xd0
[  628.940811] [<ffffffff8101c6ac>] ? task_tick_rt+0x2c/0xd0
[  629.005288] [<ffffffff81019bfc>] ? scheduler_tick+0x5c/0x70
[  629.071843] [<ffffffff81017195>] ? tick_handle_periodic+0x45/0x70
[  629.144639] [<ffffffff81006704>] ? apic_timer_interrupt+0x54/0x90
[  629.217436] [<ffffffff8100e4da>] ? smp__apic_timer_interrupt+0x6a/0x70
[  629.295432] [<ffffffff81012d94>] ? printk+0x124/0x1c0
[  629.355748] [<ffffffff8103ad1f>] report_bad_rmap+0x144/0x144
[  629.423345] [<ffffffff81031046>] pcache_referenced_trylock_one+0x1c6/0x2c0
[  629.505500] [<ffffffff8100e4da>] ? smp__apic_timer_interrupt+0x6a/0x70
[  629.583497] [<ffffffff810328a1>] rmap_walk+0x71/0xe0
[  629.642774] [<ffffffff81033329>] pcache_referenced_trylock+0x59/0xd0

03/17 Sat

I’m too tired today.

Coding side, I will only optimize sweep. Besides, I will book tickets for Iceland trip.

03/16 Friday

Task 1: Add physical memory counter. It is a per-zone based counter, even though there is also some global counters. In Linux, per-cpu counter is first accumlated, global counter is updated only when per-cpu ones overflow. Lego’s initial version save the trouble of per-cpu counter, I only port one global counter today, because I’m not quite confident about our percpu_alloc…

Anway, the info is reported in the format of manager_sysinfo. Do note this is different from the oirginal sysinfo structure, which is used by sysinfo syscall.

Task 2: Patch get_random_number and /dev/urandom /dev/random. Others wrote the code, but he did not stick to the tradition of format naming. So I have to rewrite some of them. Sigh.

Task 3: optimize sweep

03/15 Thur

Forgot to write the log yesterday. I actually solved the major bug, the refcount and eviction one. That is really nasty. I basically used pte lock, pcache_lock, and refcount to synchronize between eviction routine and other users such as munmap, mremap, write-protected-handler.

I’m really not sure if this mode can be reproduced if I have any other similar systems. But I’m glad that I find a way to do this.

Today I got few tasks going on. First merge storage syscall branch, then add sched_yield syscall, add zone/node counters, and probably patch get_random_number.

Task 1: Merge Yilun’s storage pull request, has bunch syscalls. I’m reviewing now.

  • truncate
  • ftruncate
  • getdents
  • getcwd
  • mkdir
  • rmdir
  • creat
  • unlink
  • unlinkat
  • readlink
  • statfs
  • sync

Task 2: Add sched_yield(). Fairly simple.

Task 3: Add physical memory counter. Fairly complex. The underlying is built long time ago. Need to pick up some. Well some facts:

  • pg_data_t (and zone) is allcoated by alloc_node_data if NUMA is configured.
  • all zones are built and initliazed in memory_init() in Lego
  • stats are reset to 0 when pg_data_t allocated (DUH?). Played directly in page_alloc.c

Have to continue tomorrow.

Task 4: Patch get_random_number and /dev/urandom

03/13 Wed

The slow victim flush issue is solved by pinning the thread to a core and remove that core from active_cpu mask.

Today I’m going to solve the SMP object issue. I’m hoping by solving this, we can have a complete working pcache and victim cache.

Continue yesterday’s log:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
wuklab13 0313-12
[ 1073.616269] pcache:ffff880180777a80 mapcount:0 refcount:3 flags:(locked|allocated|usable) kva: ffff88011ddea000
[ 1073.734941] __clflush_one(): EFAULT:bad address tsk: 32 user_va: 0x7fff4ddea000
[ 1073.822304] pcache dumped because: evict/ref bug

[ 1073.987667] BUG: failure at managers/processor/pcache/evict.c:301/pcache_evict_line()!

[ 1074.082308] BUG: failure at managers/processor/pcache/rmap.c:763/pcache_zap_pte()!
[ 1074.172789] Kernel Panic - not syncing: BUG!
[ 1074.223751] CPU: 23 PID: 50 Comm: word_count-pthr 4.0.0-lego-ys+ #476

Time CPU0 CPU1
0 pcache_evict_line() zap_pte_range()
1 find @pcm to evict prepare to unmap pte which points to @pcm
2 lock_pcache() ..
3 pcache_try_to_unmap() pte_offset_lock()
4 try to lock pte pcache_zap_pte()
5 ..spin.. trylock_pcache (failed)
6 ..spin.. unlock pte
7 lock pte trylock pcache
8 clear pte ..spin..
9 unlock pte ..spin..
10 unlock pcache ..spin..
11 .. lock pcache
12 .. lock pte
13 .. HERE, should check if pte changed!

Huh, patched both eviction and other code. Use refcount, pcache lock, pte lock to synchronize between all users. Make sure a going-to-be-evicted pcm will not be used by others. And others will not have a chance to use such line.

03/12 Tue

Continue victim cache. The current conclusion is victim has a unbalanced input and output rate. That is why some cores timeout and abort.

Got some more clean log. The log told us that the flushd_victim is too slow at flushing content. Next I going to print the current flush queue content. Make sure that they are really not flushed. If so, I want to add code to flush sync.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
[  318.193591] CPU4 PID:54 Abort victim alloc (10010ms) nr_usable_victims: 8 req from pset:ffff88207f81d340, pset_idx:1869, nr_lru:7
[  318.330986]   --   Start Dump Victim Cache     --
[  318.388190]   --   CPU4 [word_count-pthr][pid=54, tgid=32] --
[  318.456835] victim:ffff88207ff71000 index:0 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata|waitflush) pcm:          (null) pset:ffff88207f81d200
[  318.627406]     hit[0] owner: [word_count-pthr][32] addr: 0x7fff90748000
[  318.707492]     pset:ffff88207f81d200 set_idx: 1864 nr_lru:8
[  318.775096]
[  318.792778] victim:ffff88207ff71048 index:1 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata|waitflush) pcm:          (null) pset:ffff88207f81d240
[  318.963349]     hit[0] owner: [word_count-pthr][32] addr: 0x7fff90749000
[  319.043435]     pset:ffff88207f81d240 set_idx: 1865 nr_lru:8
[  319.111040]
[  319.128721] victim:ffff88207ff71090 index:2 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata|waitflush) pcm:          (null) pset:ffff88207f81d180
[  319.299292]     hit[0] owner: [word_count-pthr][32] addr: 0x7fff90746000
[  319.379378]     pset:ffff88207f81d180 set_idx: 1862 nr_lru:8
[  319.446983]
[  319.464664] victim:ffff88207ff710d8 index:3 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata|waitflush) pcm:          (null) pset:ffff88207f81d280
[  319.635237]     hit[0] owner: [word_count-pthr][32] addr: 0x7fff9074a000
[  319.715321]     pset:ffff88207f81d280 set_idx: 1866 nr_lru:8
[  319.782927]
[  319.800608] victim:ffff88207ff71120 index:4 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata|waitflush) pcm:          (null) pset:ffff88207f81d140
[  319.971179]     hit[0] owner: [word_count-pthr][32] addr: 0x7fff90745000
[  320.051265]     pset:ffff88207f81d140 set_idx: 1861 nr_lru:8
[  320.118870]
[  320.136551] victim:ffff88207ff71168 index:5 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata|waitflush) pcm:          (null) pset:ffff88207f81d300
[  320.307123]     hit[0] owner: [word_count-pthr][32] addr: 0x7fff9074c000
[  320.387208]     pset:ffff88207f81d300 set_idx: 1868 nr_lru:8
[  320.454813]
[  320.472494] victim:ffff88207ff711b0 index:6 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata|waitflush) pcm:          (null) pset:ffff88207f81d1c0
[  320.643066]     hit[0] owner: [word_count-pthr][32] addr: 0x7fff90747000
[  320.723152]     pset:ffff88207f81d1c0 set_idx: 1863 nr_lru:8
[  320.790756]
[  320.808438] victim:ffff88207ff711f8 index:7 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata|waitflush) pcm:          (null) pset:ffff88207f81d2c0
[  320.979009]     hit[0] owner: [word_count-pthr][32] addr: 0x7fff9074b000
[  321.059096]     pset:ffff88207f81d2c0 set_idx: 1867 nr_lru:8
[  321.126700]
[  321.144381]   --   End Dump Victim Cache       --
[  321.200545] CPU4 PID:54 fail to allocate pcache or victim cache lines.
[  321.278552] word_count-pthr[54]: segfault at 0x74d000 ip 000000000040249d sp 00007fff7674cd80 error 6
[  321.511925] nr_pgfault: 551908
[  321.546357] nr_clflush: 33449
[  321.581718] nr_pgfault_wp: 0
[  321.616040] nr_pgfault_wp_cow: 0
[  321.654523] nr_pgfault_wp_reuse: 0
[  321.695087] nr_pgfault_due_to_concurrent_eviction: 0
[  321.754371] nr_pcache_fill_from_memory: 546067
[  321.807414] nr_pcache_fill_from_victim: 5750
[  321.858378] nr_pcache_eviction_triggered: 38689
[  321.912461] nr_pcache_eviction_eagain: 5239
[  321.962385] nr_pcache_eviction_succeed: 33449
[  322.014389] nr_victim_eviction_triggered: 41887455
[  322.071592] nr_victim_eviction_eagain: 41859764
[  322.125676] nr_victim_eviction_succeed: 27691
[  322.177680] nr_victim_prepare_insert: 33450
[  322.227603] nr_victim_finish_insert: 33449
[  322.276487] nr_victim_flush_submitted: 33449
[  322.327451] nr_victim_flush_finished: 33449
[  322.377374] nr_victim_flush_async_run: 26989
[  322.428338] nr_victim_flush_sync: 0

Yes, this victims are truly not being flushed. They are inside the flush_queue. No bug, hoo! Just some performance coding issues. But god why the flushd does not get a chance to run in 10 seconds? Hmm…

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
[ 5520.236187] __clflush_one(): EFAULT:bad address tsk: 32 user_va: 0x7fff464fa000
[ 5530.404269] CPU4 PID:54 Abort victim alloc (10010ms) nr_usable_victims: 8 req from pset:ffff88207f81d340, pset_idx:1869, nr_lru:7
[ 5530.541664] CPU4 PID54   --   Start Dump Victim Cache [0]
[ 5530.606147] CPU4 PID54  victim:ffff88207ff71000 index:0 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata|waitflush) pcm:          (null) pset:ffff88207f81d1c0
[ 5530.789194] CPU4 PID54     hit[0] owner: [word_count-pthr][32] addr: 0x7fff90747000
[ 5530.880717] CPU4 PID54     rmap to pset:ffff88207f81d1c0 set_idx: 1863 nr_lru:8
[ 5530.968080] CPU4 PID54  victim:ffff88207ff71048 index:1 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata|waitflush) pcm:          (null) pset:ffff88207f81d280
[ 5531.151128] CPU4 PID54     hit[0] owner: [word_count-pthr][32] addr: 0x7fff9074a000
[ 5531.242652] CPU4 PID54     rmap to pset:ffff88207f81d280 set_idx: 1866 nr_lru:8
[ 5531.330015] CPU4 PID54  victim:ffff88207ff71090 index:2 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata|waitflush) pcm:          (null) pset:ffff88207f81d300
[ 5531.513063] CPU4 PID54     hit[0] owner: [word_count-pthr][32] addr: 0x7fff9074c000
[ 5531.604586] CPU4 PID54     rmap to pset:ffff88207f81d300 set_idx: 1868 nr_lru:8
[ 5531.691950] CPU4 PID54  victim:ffff88207ff710d8 index:3 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata|waitflush) pcm:          (null) pset:ffff88207f81d2c0
[ 5531.874997] CPU4 PID54     hit[0] owner: [word_count-pthr][32] addr: 0x7fff9074b000
[ 5531.966521] CPU4 PID54     rmap to pset:ffff88207f81d2c0 set_idx: 1867 nr_lru:8
[ 5532.053885] CPU4 PID54  victim:ffff88207ff71120 index:4 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata|waitflush) pcm:          (null) pset:ffff88207f81d200
[ 5532.236932] CPU4 PID54     hit[0] owner: [word_count-pthr][32] addr: 0x7fff90748000
[ 5532.328456] CPU4 PID54     rmap to pset:ffff88207f81d200 set_idx: 1864 nr_lru:8
[ 5532.415819] CPU4 PID54  victim:ffff88207ff71168 index:5 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata|waitflush) pcm:          (null) pset:ffff88207f81d240
[ 5532.598867] CPU4 PID54     hit[0] owner: [word_count-pthr][32] addr: 0x7fff90749000
[ 5532.690390] CPU4 PID54     rmap to pset:ffff88207f81d240 set_idx: 1865 nr_lru:8
[ 5532.777753] CPU4 PID54  victim:ffff88207ff711b0 index:6 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata|waitflush) pcm:          (null) pset:ffff88207f81d180
[ 5532.960802] CPU4 PID54     hit[0] owner: [word_count-pthr][32] addr: 0x7fff90746000
[ 5533.052325] CPU4 PID54     rmap to pset:ffff88207f81d180 set_idx: 1862 nr_lru:8
[ 5533.139689] CPU4 PID54  victim:ffff88207ff711f8 index:7 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata|waitflush) pcm:          (null) pset:ffff88207f81d140
[ 5533.322736] CPU4 PID54     hit[0] owner: [word_count-pthr][32] addr: 0x7fff90745000
[ 5533.414259] CPU4 PID54     rmap to pset:ffff88207f81d140 set_idx: 1861 nr_lru:8
[ 5533.501623] CPU4 PID54   --   End Dump Victim Cache [0]

[ 5533.566106] CPU4 PID54   --  Start Dump Victim Flush Queue [0]
[ 5533.635789] CPU4 PID54  victim:ffff88207ff711f8 index:7 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata|waitflush) pcm:          (null) pset:ffff88207f81d140
[ 5533.818837] CPU4 PID54  victim:ffff88207ff711b0 index:6 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata|waitflush) pcm:          (null) pset:ffff88207f81d180
[ 5534.001884] CPU4 PID54  victim:ffff88207ff71000 index:0 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata|waitflush) pcm:          (null) pset:ffff88207f81d1c0
[ 5534.184931] CPU4 PID54  victim:ffff88207ff71120 index:4 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata|waitflush) pcm:          (null) pset:ffff88207f81d200
[ 5534.367978] CPU4 PID54  victim:ffff88207ff71168 index:5 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata|waitflush) pcm:          (null) pset:ffff88207f81d240
[ 5534.551025] CPU4 PID54  victim:ffff88207ff71048 index:1 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata|waitflush) pcm:          (null) pset:ffff88207f81d280
[ 5534.734074] CPU4 PID54  victim:ffff88207ff710d8 index:3 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata|waitflush) pcm:          (null) pset:ffff88207f81d2c0
[ 5534.917120] CPU4 PID54  victim:ffff88207ff71090 index:2 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata|waitflush) pcm:          (null) pset:ffff88207f81d300
[ 5535.100168] CPU4 PID54   --  End Dump Victim Flush Queue [0]

[ 5535.169851] CPU4 PID:54 fail to allocate pcache or victim cache lines.
[ 5535.247854] word_count-pthr[54]: segfault at 0x74d000 ip 000000000040249d sp 00007fff7674cd80 error 6
[ 5535.480513] nr_pgfault: 549578
[ 5535.514943] nr_clflush: 31822
[ 5535.550304] nr_pgfault_wp: 0
[ 5535.584625] nr_pgfault_wp_cow: 0
[ 5535.623107] nr_pgfault_wp_reuse: 0
[ 5535.663669] nr_pgfault_due_to_concurrent_eviction: 0
[ 5535.722952] nr_pcache_fill_from_memory: 544279
[ 5535.775993] nr_pcache_fill_from_victim: 5201
[ 5535.826955] nr_pcache_eviction_triggered: 37437
[ 5535.881038] nr_pcache_eviction_eagain: 5614
[ 5535.930960] nr_pcache_eviction_succeed: 31822
[ 5535.982963] nr_victim_eviction_triggered: 42000029
[ 5536.040165] nr_victim_eviction_eagain: 41973416
[ 5536.094247] nr_victim_eviction_succeed: 26613
[ 5536.146249] nr_victim_prepare_insert: 31823
[ 5536.196171] nr_victim_finish_insert: 31822
[ 5536.245052] nr_victim_flush_submitted: 31822
[ 5536.296015] nr_victim_flush_finished: 31822
[ 5536.345937] nr_victim_flush_async_run: 26718
[ 5536.396899] nr_victim_flush_sync: 0

Hmm, got some interesting bug, which never happened before. We did a unmap before finish_insert, so the mapcount must be zero. Since we have the Reclaim set for the candidate. But it looks like other code does not too much about the Reclaim bit. I need to check.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
[ 1009.676839] victim_flush_async CPU4 jobs 1
[ 1009.725830] victim_flush_async CPU4 jobs 1
[ 1009.774423] victim_flush_async CPU4 jobs 1
[ 1009.823147] __clflush_one(): EFAULT:bad address tsk: 32 user_va: 0x7fff465fc000
[ 1009.910479] pcache:ffff88018098d740 mapcount:1 refcount:3 flags:(locked|allocated|usable|valid|reclaim) kva: ffff88012635d000
[ 1010.045652] pcache dumped because: PCACHE_BUG_ON_PCM(pcache_mapped(pcm))
[ 1010.125725] victim_flush_async CPU4 jobs 1
[ 1010.174602] ------------[ cut here ]------------
[ 1010.229717] BUG: failure at managers/processor/pcache/victim.c:601/victim_finish_insert()!
[ 1010.328509] victim_flush_async CPU4 jobs 1
[ 1010.377385] Kernel Panic - not syncing: BUG!
[ 1010.428341] CPU: 20 PID: 47 Comm: word_count-pthr 4.0.0-lego-ys+ #468
[ 1010.505294] Stack:
[ 1010.529212] ffff881f2040fe08 ffffffff810259f4 0000000000000008 ffff881f2040fe18
[ 1010.616565] ffff881f2040fdd0 0000000021475542 0000000000000000 0000000000000000
[ 1010.703918] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 1010.791270] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 1010.878623] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 1010.965976] Call Trace:
[ 1010.995095] <TSK>
[ 1011.017972] [<ffffffff81025a00>] panic+0xc2/0x102
[ 1011.074127] [<ffffffff81063a8a>] ? client_internal_poll_sendcq+0x2a/0x80
[ 1011.154202] [<ffffffff81063c2d>] ? client_send_message_with_rdma_write_with_imm_request+0x14d/0x360
[ 1011.262351] [<ffffffff8101bffc>] ? task_tick_rt+0x2c/0xd0
[ 1011.326827] [<ffffffff81019755>] ? scheduler_tick+0x55/0x60
[ 1011.393382] [<ffffffff81016e25>] ? tick_handle_periodic+0x45/0x70
[ 1011.466175] [<ffffffff810066e4>] ? apic_timer_interrupt+0x54/0x90
[ 1011.538969] [<ffffffff8100e4aa>] ? smp__apic_timer_interrupt+0x6a/0x70
[ 1011.616964] [<ffffffff81012cfd>] ? printk+0x11d/0x1b0
[ 1011.677279] [<ffffffff81032a19>] victim_finish_insert+0x89/0x230
[ 1011.749032] [<ffffffff81031a99>] pcache_evict_line+0x79/0x280
[ 1011.817667] [<ffffffff8102f00a>] pcache_alloc+0x23a/0x340
[ 1011.882141] [<ffffffff8102e4da>] common_do_fill_page+0x2a/0x1b0
[ 1011.952856] [<ffffffff8102e160>] ? pcache_meta_to_kva+0x30/0x30
[ 1012.023570] [<ffffffff8102e802>] pcache_handle_fault+0x1a2/0x660
[ 1012.095324] [<ffffffff810102b2>] do_page_fault+0xa2/0x1a0
[ 1012.159799] [<ffffffff8100dadf>] page_fault+0x1f/0x30

Interesting. Memory consistency issue? Actually, I’m not sure if it is the v->flags = 0 issue. Others use atomic bit operations to play with this flag, while the reset is a simple store. I checked the list operations, all of them are protected by spinlock. So the below should never happen in theory. I’m changing the v->flags = 0 to smp_store_mb(v->flags, 0), which is a xchg in x86. Same for pcache.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
[ 1773.814490] CPU17 PID44  victim:ffff88207ff71000 index:0 refcount:1 nr_fill:0 locked:0 flags:(allocated|usable) pcm:          (null) pset:          (null)
[ 1773.979705] CPU17 PID44     hit[0] owner: [word_count-pthr][32] addr: 0x7fff95b1c000
[ 1774.072260] CPU17 PID44     rmap to pset:ffff88207f96c700 set_idx: 23324 nr_lru:8
[ 1774.161694] CPU17 PID44     victim dumped because: PCACHE_BUG_ON_VICTIM(!VictimUsable(v))
[ 1774.259451] ------------[ cut here ]------------
[ 1774.314567] BUG: failure at managers/processor/pcache/victim.c:231/find_victim_to_evict()!
[ 1774.413363] Kernel Panic - not syncing: BUG!
[ 1774.464320] CPU: 17 PID: 44 Comm: word_count-pthr 4.0.0-lego-ys+ #47
...
[ 1781.363348] nr_pcache_fill_from_victim: 2

Did another run. I added an explicit wake_up_victim_flushd if victim failed to evict any line. But this fails with IB failure..

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
[ 2336.950087] CPU4 PID:54 Abort victim alloc (20010ms) nr_usable_victims: 8 req from pset:ffff88207f81d340, pset_idx:1869, nr_lru:7
[ 2337.087474] CPU4 PID54   --   Start Dump Victim Cache [0]
[ 2337.151955] CPU4 PID54  victim:ffff88207ff71000 index:0 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata|waitflush) pcm:          (null) pset:ffff88207f81d280
[ 2337.334999] CPU4 PID54     hit[0] owner: [word_count-pthr][32] addr: 0x7fff9074a000
[ 2337.426521] CPU4 PID54     rmap to pset:ffff88207f81d280 set_idx: 1866 nr_lru:8
[ 2337.513883] CPU4 PID54  victim:ffff88207ff71048 index:1 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata|waitflush) pcm:          (null) pset:ffff88207f81d2c0
[ 2337.696927] CPU4 PID54     hit[0] owner: [word_count-pthr][32] addr: 0x7fff9074b000
[ 2337.788450] CPU4 PID54     rmap to pset:ffff88207f81d2c0 set_idx: 1867 nr_lru:8
...
...
[ 2340.111861] CPU4 PID54   --  Start Dump Victim Flush Queue [0]
[ 2340.181543] CPU4 PID54  victim:ffff88207ff71090 index:2 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata|waitflush) pcm:          (null) pset:ffff88207f81d140
[ 2340.364587] CPU4 PID54  victim:ffff88207ff71120 index:4 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata|waitflush) pcm:          (null) pset:ffff88207f81d180
[ 2340.547632] CPU4 PID54  victim:ffff88207ff711f8 index:7 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata|waitflush) pcm:          (null) pset:ffff88207f81d1c0
[ 2340.730675] CPU4 PID54  victim:ffff88207ff71168 index:5 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata|waitflush) pcm:          (null) pset:ffff88207f81d200
[ 2340.913720] CPU4 PID54  victim:ffff88207ff711b0 index:6 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata|waitflush) pcm:          (null) pset:ffff88207f81d240
[ 2341.096763] CPU4 PID54  victim:ffff88207ff71000 index:0 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata|waitflush) pcm:          (null) pset:ffff88207f81d280
[ 2341.279808] CPU4 PID54  victim:ffff88207ff71048 index:1 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata|waitflush) pcm:          (null) pset:ffff88207f81d2c0
[ 2341.462851] CPU4 PID54  victim:ffff88207ff710d8 index:3 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata|waitflush) pcm:          (null) pset:ffff88207f81d300
[ 2341.645895] CPU4 PID54   --  End Dump Victim Flush Queue [0]

[ 2341.715577] CPU4 PID:54 fail to allocate pcache or victim cache lines.
[ 2341.793579] word_count-pthr[54]: segfault at 0x74d000 ip 000000000040249d sp 00007fff7674cd80 error 6
[ 2476.201442] mlx4_ib_handle_error_cqe syndrome 21
[ 2476.254590] mlx4_ib_handle_error_cqe syndrome 5
[ 2476.308670] send request failed at connection 4 as 12
[ 2476.368991] mlx4_ib_handle_error_cqe syndrome 5
[ 2476.423073] mlx4_ib_handle_error_cqe syndrome 5
[ 2476.477153] mlx4_ib_handle_error_cqe syndrome 5
[ 2476.531236] client_poll_cq: failed status (5) for wr_id 1051
[ 2476.598837] client_poll_cq: failed status (5) for wr_id 1052
[ 2476.666438] __clflush_one(): EPERM:Operation not permitted tsk: 32 user_va: 0x7fff90745000
[ 2476.765240] client_poll_cq: connection 4 Recv weird event as -30704
[ 2476.840122] client_poll_cq: failed status (5) for wr_id 1053
[ 2476.907724] client_poll_cq: connection 4 Recv weird event as -30704
[ 2476.982605] client_poll_cq: failed status (5) for wr_id 1054
[ 2477.050207] client_poll_cq: connection 4 Recv weird event as -30704
[ 2477.125089] mlx4_ib_handle_error_cqe syndrome 5
[ 2477.179169] mlx4_ib_handle_error_cqe syndrome 5
[ 2477.233251] mlx4_ib_handle_error_cqe syndrome 5
[ 2477.287332] mlx4_ib_handle_error_cqe syndrome 5
[ 2477.341414] client_poll_cq: failed status (5) for wr_id 1055
[ 2477.409016] client_poll_cq: failed status (5) for wr_id 1056
..
..
[ 2477.761583] client_poll_cq: connection 4 Recv weird event as -30704
[ 2477.836464] mlx4_ib_handle_error_cqe syndrome 5
[ 2477.890545] mlx4_ib_handle_error_cqe syndrome 5
[ 2477.944626] mlx4_ib_handle_error_cqe syndrome 5
[ 2477.998707] mlx4_ib_handle_error_cqe syndrome 5
[ 2478.052789] client_poll_cq: failed status (5) for wr_id 1059
[ 2478.120392] BUG: unable to handle kernel NULL pointer dereference at           (null)
[ 2478.213992] IP: [<ffffffff81064894>] client_poll_cq+0x1f4/0x6c0
[ 2478.284714] PGD 0
[ 2478.308635] Oops: 0002 [#1] SMP PROCESSOR
[ 2478.356476] CPU: 2 PID: 29 Comm: recvpollcq 4.0.0-lego-ys+ #473
[ 2478.427197] RIP: 0010:[<ffffffff81064894>]  [<ffffffff81064894>] client_poll_cq+0x1f4/0x6c0
[ 2478.527040] RSP: 0000:ffff88107e143d90  EFLAGS: 00010246
[ 2478.590481] RAX: 0000000000000000 RBX: ffff88207fc6e000 RCX: 0000000000000000
[ 2478.675762] RDX: 0000000000001008 RSI: ffffffff811d36e0 RDI: ffffffff811dab08
[ 2478.761044] RBP: ffff88107e143eb0 R08: 0000000000000000 R09: 0000000000000000
[ 2478.846327] R10: 0000000000000002 R11: 0000000000000004 R12: ffff88207fd4f000
[ 2478.931609] R13: 0000000000000004 R14: ffff88107e143da8 R15: 0000000000000000
[ 2479.016890] FS:  0000000000000000(0000) GS:ffff88107fc20000(0000) knlGS:0000000000000000
[ 2479.113613] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2479.182254] CR2: 0000000000000000 CR3: 000000000113d000 CR4: 00000000000406a0
[ 2479.267536] Stack:
[ 2479.291457] ffff88107e143da0 0010129c81019794 0000000000000001 0000000000000423
[ 2479.378818] 0000008100000005 00001008000000f9 ffff88207fd39000 0000000040000000
[ 2479.466180] 000f004000000002 ffff88107e140000 0000000000000424 ffff881000000005
[ 2479.553542] 00000000000000f9 ffff88207fd39000 ffff88107e143e38 ffffffff81019e44
[ 2479.640904] 0000000000000001 0000000000000425 ffff881000000005 ffffffff000000f9
[ 2479.728266] Call Trace:
[ 2479.757386] <TSK>
[ 2479.780268] [<ffffffff81019e44>] ? try_to_wake_up+0xe4/0x1f0
[ 2479.847869] [<ffffffff81066d78>] ? __schedule+0xf8/0x1e0
[ 2479.911311] [<ffffffff81064d60>] ? client_poll_cq+0x6c0/0x6c0
[ 2479.979952] [<ffffffff81064d70>] client_poll_cq_pass+0x10/0x20
[ 2480.049634] [<ffffffff81020336>] kthread+0xf6/0x110
[ 2480.107875] [<ffffffff81020240>] ? __kthread_parkme+0x70/0x70
[ 2480.176516] [<ffffffff8100e732>] ret_from_fork+0x22/0x30

A classical SMP bug. Lucky to find this one. Let me try to describe this. There are two CPU1. CPU0 and CPU1. CPU0 is doing eviction while CPU1 is doing munmap->pcache_zap_pte. The CPU0 slected a pcm, while this pcm happen to be zapped at the same time by CPU1. There are not enough actions to either 1) prevent CPU0 from selecting this pcm, 2) prevent CPU1 from using this pcm. Both solutions might be work. But we need as least one.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
wuklab13 0313-12
[ 1073.616269] pcache:ffff880180777a80 mapcount:0 refcount:3 flags:(locked|allocated|usable) kva: ffff88011ddea000
[ 1073.734941] __clflush_one(): EFAULT:bad address tsk: 32 user_va: 0x7fff4ddea000
[ 1073.822304] pcache dumped because: evict/ref bug

[ 1073.987667] BUG: failure at managers/processor/pcache/evict.c:301/pcache_evict_line()!

[ 1074.082308] BUG: failure at managers/processor/pcache/rmap.c:763/pcache_zap_pte()!
[ 1074.172789] Kernel Panic - not syncing: BUG!
[ 1074.223751] CPU: 23 PID: 50 Comm: word_count-pthr 4.0.0-lego-ys+ #476

03/11 Mon

Debug victim cache

Morning. Today I will continue debugging victim and clflush, running with MT phoenix+2GB, seq+4GB. Sounds good.

Digging into yesterday’s 21th run log. The warning comes from victim_alloc_slowpath. The allocation abort after 10 seconds timeout. And interestingly, a lot threads abort. (The case is, pset is full, so pcache_alloc will try to evict one to victim cache. But victim cache is full as well. So it needs to evict one victim cache line too. Somehow this does not proceed as planned.) I guess somewhere deadlock happens.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
[ 1682.040428] WARNING: CPU: 7 PID: 34 at managers/processor/pcache/victim.c:447 victim_prepare_insert+0x322/0x4b0
[ 1682.161063] WARNING: CPU: 19 PID: 46 at managers/processor/pcache/victim.c:447 victim_prepare_insert+0x322/0x4b0
[ 1686.602779] WARNING: CPU: 10 PID: 37 at managers/processor/pcache/victim.c:447 victim_prepare_insert+0x322/0x4b0
[ 1687.384837] WARNING: CPU: 3 PID: 53 at managers/processor/pcache/victim.c:447 victim_prepare_insert+0x322/0x4b0
[ 1687.505474] WARNING: CPU: 21 PID: 48 at managers/processor/pcache/victim.c:447 victim_prepare_insert+0x322/0x4b0
[ 1687.737386] WARNING: CPU: 16 PID: 43 at managers/processor/pcache/victim.c:447 victim_prepare_insert+0x322/0x4b0
[ 1687.859063] WARNING: CPU: 4 PID: 54 at managers/processor/pcache/victim.c:447 victim_prepare_insert+0x322/0x4b0
[ 1688.034819] WARNING: CPU: 6 PID: 56 at managers/processor/pcache/victim.c:447 victim_prepare_insert+0x322/0x4b0
[ 1688.210574] WARNING: CPU: 14 PID: 41 at managers/processor/pcache/victim.c:447 victim_prepare_insert+0x322/0x4b0
[ 1688.488246] WARNING: CPU: 5 PID: 55 at managers/processor/pcache/victim.c:447 victim_prepare_insert+0x322/0x4b0
[ 1689.598935] WARNING: CPU: 22 PID: 49 at managers/processor/pcache/victim.c:447 victim_prepare_insert+0x322/0x4b0
[ 1689.953565] WARNING: CPU: 0 PID: 51 at managers/processor/pcache/victim.c:447 victim_prepare_insert+0x322/0x4b0
[ 1691.740234] WARNING: CPU: 13 PID: 40 at managers/processor/pcache/victim.c:447 victim_prepare_insert+0x322/0x4b0
[ 1691.861911] WARNING: CPU: 1 PID: 52 at managers/processor/pcache/victim.c:447 victim_prepare_insert+0x322/0x4b0
[ 1791.554552] WARNING: CPU: 11 PID: 38 at managers/processor/pcache/victim.c:447 victim_prepare_insert+0x322/0x4b0

1st run. MT+2GB. Victim allocation as predicted. Somehow I already forgot how the code is designed. I need to take a detailed reread.

Along the testing, fixed a bug in eviction code: handle failed evict_line properly. If eviction mechanism failed, we need to clear what the algorithm part has done. This is also related to yesterday’s big idea: always do proper cleanup. Many thanks go to pcache free checking, help me to find this bug.

Less is more. I printed too much useless info when pcache_alloc or victim_alloc fail. I removed all the dump_pset from the failing path. It can give me a much more clean message to debug.

Hmm, it is really weird. I dump all victims once alloc timeout. You can see that all victim are not Flushed, that means none of them can be evicted. Take a look at the stat. Hmm, I probabaly should not do this per-cpu counter??

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
...
[ 4751.460819]   --   Start Dump Victim Cache     --
[ 4751.518022]   --   CPU19 [word_count-pthr][pid=46, tgid=32] --
[ 4751.587706] victim:ffff88207ff71000 index:0 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata) pcm:          (null) pset:ffff88207f800440
[ 4751.747872]     hit[0] owner: [word_count-pthr][32] addr: 0x7fff20011000
[ 4751.827955]     pset:ffff88207f800440 set_idx: 17 nr_lru:8
[ 4751.893478]
[ 4751.911159] victim:ffff88207ff71048 index:1 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata) pcm:          (null) pset:ffff88207f8003c0
[ 4752.071326]     hit[0] owner: [word_count-pthr][32] addr: 0x7fff2000f000
[ 4752.428060]     pset:ffff88207f8003c0 set_idx: 15 nr_lru:8
[ 4752.630868]
[ 4752.931441] victim:ffff88207ff71090 index:2 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata) pcm:          (null) pset:ffff88207f800540
[ 4753.370339]     hit[0] owner: [word_count-pthr][32] addr: 0x7fff20015000
[ 4753.450422]     pset:ffff88207f800540 set_idx: 21 nr_lru:8
[ 4753.515945]
[ 4753.533627] victim:ffff88207ff710d8 index:3 refcount:3 nr_fill:1 locked:0 flags:(allocated|usable|hasdata) pcm:          (null) pset:ffff88207fbdff40
[ 4753.693792]     hit[0] owner: [word_count-pthr][32] addr: 0x7fffbf7fd000
[ 4753.773875]     pset:ffff88207fbdff40 set_idx: 63485 nr_lru:7
[ 4753.842518]
[ 4753.860199] victim:ffff88207ff71120 index:4 refcount:3 nr_fill:0 locked:0 flags:(allocated|usable|hasdata) pcm:          (null) pset:ffff88207f800500
[ 4754.020367]     hit[0] owner: [word_count-pthr][32] addr: 0x7fff20014000
[ 4754.100449]     pset:ffff88207f800500 set_idx: 20 nr_lru:8
[ 4754.165971]
[ 4754.183653] victim:ffff88207ff71168 index:5 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata) pcm:          (null) pset:ffff88207f800480
[ 4754.343819]     hit[0] owner: [word_count-pthr][32] addr: 0x7fff30012000
[ 4754.423902]     pset:ffff88207f800480 set_idx: 18 nr_lru:8
[ 4754.489426]
[ 4754.507106] victim:ffff88207ff711b0 index:6 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata) pcm:          (null) pset:ffff88207f8004c0
[ 4754.808718]     hit[0] owner: [word_count-pthr][32] addr: 0x7fff30013000
[ 4754.888802]     pset:ffff88207f8004c0 set_idx: 19 nr_lru:8
[ 4754.954325]
[ 4754.972006] victim:ffff88207ff711f8 index:7 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata) pcm:          (null) pset:ffff88207f800400
[ 4755.132172]     hit[0] owner: [word_count-pthr][32] addr: 0x7fff20010000
[ 4755.212255]     pset:ffff88207f800400 set_idx: 16 nr_lru:8
[ 4755.277778]
[ 4755.295458]   --   End Dump Victim Cache       --

...

[ 4757.948641] nr_pgfault: 313898
[ 4757.983067] nr_clflush: 488
[ 4758.016347] nr_pgfault_wp: 0
[ 4758.050669] nr_pgfault_wp_cow: 0
[ 4758.089151] nr_pgfault_wp_reuse: 0
[ 4758.129713] nr_pgfault_due_to_concurrent_eviction: 0
[ 4758.188995] nr_pcache_fill_from_memory: 313833
[ 4758.242038] nr_pcache_fill_from_victim: 54
[ 4758.290919] nr_pcache_eviction_triggered: 243280263
[ 4758.349161] nr_pcache_eviction_eagain: 243279763
[ 4758.404283] nr_pcache_eviction_succeed: 488
[ 4758.454207] nr_victim_eviction: 426
[ 4758.495807] nr_victim_prepare_insert: 500
[ 4758.543649] nr_victim_finish_insert: 488
[ 4758.590451] nr_victim_flush_submitted: 488
[ 4758.639333] nr_victim_flush_finished: 488

I counted it wrong. Below is the log. Since nr_victim_flushd_run * 8 = nr_victim_flush_finished, it basically means for every run, victim_flushd needs to flush all 8 victims, which implies eviction rate is much higher than the flushd running rate. nr_pcache_fill_from_victim: 21, which means there are some succeed refills, but I don’t know how it can improve performance.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
[  475.468489] CPU4 PID:54 Abort victim alloc (10010ms) nr_usable_victims: 8 req from pset:ffff88207f800000, pset_idx:0, nr_lru:7
[  475.602752] CPU3 PID:53 Abort victim alloc (10010ms) nr_usable_victims: 8 req from pset:ffff88207f900a00, pset_idx:16424, nr_lru:7
[  476.029145] CPU5 PID:55 Abort victim alloc (10010ms) nr_usable_victims: 8 req from pset:ffff88207fbdff40, pset_idx:63485, nr_lru:7
[  476.169542] CPU9 PID:36 Abort victim alloc (10010ms) nr_usable_victims: 8 req from pset:ffff88207f900000, pset_idx:16384, nr_lru:7
[  477.360322] CPU1 PID:52 Abort victim alloc (10010ms) nr_usable_victims: 8 req from pset:ffff88207fbfff80, pset_idx:65534, nr_lru:7
[  479.206291] CPU18 PID:45 Abort victim alloc (10010ms) nr_usable_victims: 8 req from pset:ffff88207fb00000, pset_idx:49152, nr_lru:7

[  475.743150]   --   Start Dump Victim Cache     --
[  475.800350]   --   CPU4 [word_count-pthr][pid=54, tgid=32] --
[  475.868989] victim:ffff88207ff71000 index:0 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata) pcm:          (null) pset:ffff88207f800a80
[  476.309940]     hit[0] owner: [word_count-pthr][32] addr: 0x7fff3002a000
[  476.390020]     pset:ffff88207f800a80 set_idx: 42 nr_lru:8
[  476.455538]
[  476.473218] victim:ffff88207ff71048 index:1 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata) pcm:          (null) pset:ffff88207f800bc0
[  476.633376]     hit[0] owner: [word_count-pthr][32] addr: 0x7fff4002f000
[  476.713453]     pset:ffff88207f800bc0 set_idx: 47 nr_lru:8
[  476.778972]
[  476.796652] victim:ffff88207ff71090 index:2 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata) pcm:          (null) pset:ffff88207f800b80
[  476.956809]     hit[0] owner: [word_count-pthr][32] addr: 0x7fff3002e000
[  477.036889]     pset:ffff88207f800b80 set_idx: 46 nr_lru:8
[  477.102406]
[  477.120086] victim:ffff88207ff710d8 index:3 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata) pcm:          (null) pset:ffff88207f800a00
[  477.280245]     hit[0] owner: [word_count-pthr][32] addr: 0x7fff30028000
[  477.500721]     pset:ffff88207f800a00 set_idx: 40 nr_lru:8
[  477.566239]
[  477.583918] victim:ffff88207ff71120 index:4 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata) pcm:          (null) pset:ffff88207f800b40
[  477.744077]     hit[0] owner: [word_count-pthr][32] addr: 0x7fff3002d000
[  477.824155]     pset:ffff88207f800b40 set_idx: 45 nr_lru:8
[  477.889673]
[  477.907353] victim:ffff88207ff71168 index:5 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata) pcm:          (null) pset:ffff88207f800b00
[  478.067511]     hit[0] owner: [word_count-pthr][32] addr: 0x7fff3002c000
[  478.147590]     pset:ffff88207f800b00 set_idx: 44 nr_lru:8
[  478.213109]
[  478.230788] victim:ffff88207ff711b0 index:6 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata) pcm:          (null) pset:ffff88207f800a40
[  478.390946]     hit[0] owner: [word_count-pthr][32] addr: 0x7fff30029000
[  478.471024]     pset:ffff88207f800a40 set_idx: 41 nr_lru:8
[  478.536542]
[  478.554222] victim:ffff88207ff711f8 index:7 refcount:2 nr_fill:0 locked:0 flags:(allocated|usable|hasdata) pcm:          (null) pset:ffff88207f800ac0
[  478.714380]     hit[0] owner: [word_count-pthr][32] addr: 0x7fff3002b000
[  478.794458]     pset:ffff88207f800ac0 set_idx: 43 nr_lru:8
[  478.859977]
[  478.877657]   --   End Dump Victim Cache       --

[  480.324070] nr_pgfault: 372353
[  480.358494] nr_clflush: 336
[  480.391774] nr_pgfault_wp: 0
[  480.426093] nr_pgfault_wp_cow: 0
[  480.464573] nr_pgfault_wp_reuse: 0
[  480.505132] nr_pgfault_due_to_concurrent_eviction: 0
[  480.564410] nr_pcache_fill_from_memory: 372326
[  480.617450] nr_pcache_fill_from_victim: 21
[  480.666330] nr_pcache_eviction_triggered: 178320088
[  480.724569] nr_pcache_eviction_eagain: 178319746
[  480.779687] nr_pcache_eviction_succeed: 336
[  480.829606] nr_victim_eviction_triggered: 20589049
[  480.886805] nr_victim_eviction_eagain: 20588741
[  480.940885] nr_victim_eviction_succeed: 308
[  480.990804] nr_victim_prepare_insert: 342
[  481.038643] nr_victim_finish_insert: 336
[  481.085442] nr_victim_flush_submitted: 336
[  481.134321] nr_victim_flush_finished: 336
[  481.182161] nr_victim_flushd_run: 42

03/10 Sun

Fix bug from __unhash_procees()

[Summary]: a bug cause by laziness. When fork happens, the new thread is added into parent’s thread_group list. However, we forgot to remove it when the new thread exit. Thus, the field in parent’s thread_group will point to a freed page. To make it worse, the freed page got allocated again. In our case, the page was used by pgtable. So, when the parent tries to use that field, it simply corrupts pgtable. This bug is fixed by this commit: 64d43fc.

Got something going on. Huh.

Anyway, pick up what left last night.

8th run,

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
[  426.595911] SYSC_mmap(cpu5): ret_addr:0x7ffefbeac000

pte page got allocated
[  426.653216]     pmd is none index 0x1e3 line 567 from_addr 0x7ffefc6acd90
[  426.734334] __pte_alloc(): for addr: 0x7ffefc6acd90 pte_index: ac
[  426.807132]     pte is none index 0x38 line 574 from_addr 0x7ffefc6acd90
[  427.304148]     pte is none index 0x38 line 576 from_addr 0x7ffefc6acd90

this addr seems fine
[  427.382251]     pte is none index 0x38 line 567 from_addr 0x7ffefc6abe78
[  427.462329]     pte is none index 0x38 line 574 from_addr 0x7ffefc6abe78
[  427.644439]     pte is none index 0x38 line 576 from_addr 0x7ffefc6abe78

Something happen in between corrupted pgtable
[  427.722547] pte:ffff88207e8b51c0 pfn:0x8207e8c3 flags:(dirty|large|global|softw4|pkey0|pkey1|pkey2|pkey3|nx|0x3ff800000000000)
[  427.858779] line: 567 from_addr: 0x6fc6d8 pte.cont: 0xffff88207e8c31c0

[  427.938858] pte:ffff88207e8b51c0 pfn:0x8207e8c3 flags:(dirty|large|global|softw4|pkey0|pkey1|pkey2|pkey3|nx|0x3ff800000000000)
[  428.075095] line: 574 from_addr: 0x6fc6d8 pte.cont: 0xffff88207e8c31c0

9th run, found actually it created another thread. And it exit. And it corrupted aftet the pid33 exit. Bang, it should be something wrong in exit().

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
wuklab13 0311-4
[  813.127325] CPU6 pid:33     pmd is none index 0x1e3 line 586 from_addr 0x4b0db0
[  813.214683] CPU5 pid:32     pmd is none index 0x1e3 line 593 from_addr 0x6f4768
[  813.302042] CPU6 pid:33     pmd is none index 0x1e3 line 593 from_addr 0x4b0db0
[  813.397836] CPU5 pid:32     pmd is none index 0x1e3 line 595 from_addr 0x6f4768
[  813.593364] CPU6 pid:33     pmd is none index 0x1e3 line 595 from_addr 0x4b0db0

[  813.678751] do_exit() pid:33,tgid:32 code:0x0

[  814.474321] CPU5 pid:32     pmd is none index 0x1e3 line 567 from_addr 0x7ffefc6acd90
[  814.567918] CPU5 pid:32     pmd is none index 0x1e3 line 575 from_addr 0x7ffefc6acd90
[  814.661516] CPU5 pid:32     pmd is none index 0x1e3 line 583 from_addr 0x7ffefc6acd90
[  814.755115] CPU5 pid:32     pmd is none index 0x1e3 line 586 from_addr 0x7ffefc6acd90
[  814.848714] __pte_alloc(): for addr: 0x7ffefc6acd90 pte_index: ac
[  814.921511] CPU5 pid:32     pte is none index 0x38 line 593 from_addr 0x7ffefc6acd90
[  815.125249] CPU5 pid:32     pte is none index 0x38 line 595 from_addr 0x7ffefc6acd90
[  815.215833] After pcache_handle_fault
[  815.259511] CPU5 pid:32     pte is none index 0x38 line 726 from_addr 0x7ffefc6acd90

[  815.352071] CPU5 pid:32     pte is none index 0x38 line 567 from_addr 0x7ffefc6abe78
[  815.444627] CPU5 pid:32     pte is none index 0x38 line 575 from_addr 0x7ffefc6abe78
[  815.537186] CPU5 pid:32     pte is none index 0x38 line 583 from_addr 0x7ffefc6abe78
[  815.629744] CPU5 pid:32     pte is none index 0x38 line 586 from_addr 0x7ffefc6abe78
[  815.722303] CPU5 pid:32     pte is none index 0x38 line 593 from_addr 0x7ffefc6abe78
[  815.916890] CPU5 pid:32     pte is none index 0x38 line 595 from_addr 0x7ffefc6abe78
[  816.007471] After pcache_handle_fault
[  816.051151] CPU5 pid:32     pte is none index 0x38 line 726 from_addr 0x7ffefc6abe78

[  816.143715] pte:ffff88207e8b51c0 pfn:0x8207e8c3 flags:(dirty|large|global|softw4|pkey0|pkey1|pkey2|pkey3|nx|0x3ff800000000000)
[  816.279946] do_exit() pid:34,tgid:32 code:0x0
[  816.331945] CPU5 pid:32 line: 567 from_addr: 0x6fc6d8 pte.cont: 0xffff88207e8c31c0

10th run, actually 2 threads are created. When pid 33 exit, everything stays okay. But after fork of pid 34. It went wrong:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
wuklab13 0311-8
[  609.490893] do_exit() pid:33,tgid:32 code:0x0

[  609.542894] CPU6 pid:33 caller: do_exit    pmd is none index 0x1e3 line 401 from_addr 0x0
[  609.640661] CPU6 pid:33 caller: do_exit    pmd is none index 0x1e3 line 443 from_addr 0x0
[  609.738429] CPU6 pid:33 caller: do_exit    pmd is none index 0x1e3 line 465 from_addr 0x0
[  609.836197] exit_mm:378 mm->users 2 mm->count 1
[  609.891320] exit_mm:380 mm->users 1 mm->count 1
[  609.946445] CPU6 pid:33 caller: do_exit    pmd is none index 0x1e3 line 468 from_addr 0x0
[  610.044212] CPU6 pid:33 caller: do_exit    pmd is none index 0x1e3 line 471 from_addr 0x0
[  610.141979] CPU6 pid:33 caller: do_exit    pmd is none index 0x1e3 line 474 from_addr 0x0

[  610.239747] SYSC_mmap(cpu5): ret_addr:0x7ffefbeac000
[  610.299031] CPU6 pid:33 caller: do_exit    pmd is none index 0x1e3 line 482 from_addr 0x0
[  610.396798] CPU5 pid:32 caller: pcache_handle_fault    pmd is none index 0x1e3 line 568 from_addr 0x7ffefc6acd90
[  610.518489] CPU5 pid:32 caller: pcache_handle_fault    pmd is none index 0x1e3 line 576 from_addr 0x7ffefc6acd90
[  610.640178] CPU5 pid:32 caller: pcache_handle_fault    pmd is none index 0x1e3 line 584 from_addr 0x7ffefc6acd90
[  610.761866] CPU5 pid:32 caller: pcache_handle_fault    pmd is none index 0x1e3 line 587 from_addr 0x7ffefc6acd90
[  610.883557] __pte_alloc(): for addr: 0x7ffefc6acd90 pte_index: ac
[  610.956362] CPU5 pid:32 caller: pcache_handle_fault    pte is none index 0x38 line 594 from_addr 0x7ffefc6acd90
[  611.179051] CPU5 pid:32 caller: pcache_handle_fault    pte is none index 0x38 line 596 from_addr 0x7ffefc6acd90
[  611.297723] After pcache_handle_fault
[  611.341406] CPU5 pid:32 caller: do_page_fault    pte is none index 0x38 line 726 from_addr 0x7ffefc6acd90
[  611.455816] CPU5 pid:32 caller: pcache_handle_fault    pte is none index 0x38 line 568 from_addr 0x7ffefc6abe78
[  611.576464] CPU5 pid:32 caller: pcache_handle_fault    pte is none index 0x38 line 576 from_addr 0x7ffefc6abe78
[  611.697113] CPU5 pid:32 caller: pcache_handle_fault    pte is none index 0x38 line 584 from_addr 0x7ffefc6abe78
[  611.817762] CPU5 pid:32 caller: pcache_handle_fault    pte is none index 0x38 line 587 from_addr 0x7ffefc6abe78
[  611.938412] CPU5 pid:32 caller: pcache_handle_fault    pte is none index 0x38 line 594 from_addr 0x7ffefc6abe78
[  612.161103] CPU5 pid:32 caller: pcache_handle_fault    pte is none index 0x38 line 596 from_addr 0x7ffefc6abe78
[  612.279778] After pcache_handle_fault
[  612.323461] CPU5 pid:32 caller: do_page_fault    pte is none index 0x38 line 726 from_addr 0x7ffefc6abe78

[  612.437875] do_fork: current: 32 new: 34

[  612.484676] pte:ffff88207e8b51c0 pfn:0x8207e8c3 flags:(dirty|large|global|softw4|pkey0|pkey1|pkey2|pkey3|nx|0x3ff800000000000)
[  612.620924] do_exit() pid:34,tgid:32 code:0x0
[  612.672928] CPU5 pid:32 caller: pcache_handle_faultline: 568 from_addr: 0x6fc6d8 pte.cont: 0xffff88207e8c31c0

[  612.793577] pte:ffff88207e8b51c0 pfn:0x8207e8c3 flags:(dirty|large|global|softw4|pkey0|pkey1|pkey2|pkey3|nx|0x3ff800000000000)
[  612.929828] pte:ffff88207e8b51c0 pfn:0x8207e8c3 flags:(dirty|large|global|softw4|pkey0|pkey1|pkey2|pkey3|nx|0x3ff800000000000)
[  613.066078] CPU7 pid:34 caller: do_exitline: 401 from_addr: 0x0 pte.cont: 0xffff88207e8c31c0

11th run, found it orignate from copy_process(). Good.

1
2
3
4
[  869.591729] CPU5 pid:32 caller: do_fork    pte is none index 0x38 line 886 from_addr 0x0

[  869.688449] pte:ffff88207e8b51c0 pfn:0x8207e8c3 flags:(dirty|large|global|softw4|pkey0|pkey1|pkey2|pkey3|nx|0x3ff800000000000)
[  869.824681] CPU5 pid:32 caller: do_fork line: 894 from_addr: 0x0 pte.cont: 0xffff88207e8c31c0

12th run, found the opeation that corrupt pgtable:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
[ 1099.974106] CPU5 pid:32 caller: copy_process    pte is none index 0x38 line 897 from_addr 0x0
[ 1100.076032] pte:ffff88207e8b51c0 pfn:0x8207e8c3 flags:(dirty|large|global|softw4|pkey0|pkey1|pkey2|pkey3|nx|0x3ff800000000000)
[ 1100.212282] CPU5 pid:32 caller: copy_process line: 902 from_addr: 0x0 pte.cont: 0xffff88207e8c31c0

896         if (current->tgid == 32)
897                 jasmine(0, __LINE__, __func__);
898
899                         list_add_tail(&p->thread_group,
900                                           &p->group_leader->thread_group);
901         if (current->tgid == 32)
902                 jasmine(0, __LINE__, __func__);

13th run, interesting, the list_add_tail write to the pgtable. pte.cont = 0xffff88207e8c31c0, p->thread_group: 0xffff88207e8c31c0.

1
2
3
4
5
6
7
8
[  916.269942] CPU5 pid:32 caller: copy_process    pte is none index 0x38 line 898 from_addr 0x0

[  916.371863] p: ffff88207e8c3000 p->group_leader: ffff88107e190000(32) p->thread_group: ffff88207e8c31c0 leader->thread_grou: ffff88107e1901c0

[  916.523705] pte:ffff88207e8b51c0 pfn:0x8207e8c3 flags:(dirty|large|global|softw4|pkey0|pkey1|pkey2|pkey3|nx|0x3ff800000000000)
[  916.659947] CPU5 pid:32 caller: copy_process line: 906 from_addr: 0x0 pte.cont: 0xffff88207e8c31c0

[  916.769148] p: ffff88207e8c3000 p->group_leader: ffff88107e190000(32) p->thread_group: ffff88207e8c31c0 leader->thread_grou: ffff88107e1901c0

14th run, got an log like this. Clearly, the pte is written the value of p->thread_group. But the leader’s pointer is correct. Weird, going to dig deeper.

1
2
3
4
5
6
7
8
9
                p: ffff88207e8c3000 p->group_leader: ffff88107e189000(32)

                p->thread_group:        ffff88207e8c31c0
                leader->thread_group:   ffff88107e1891c0

                pte page:               ffff88207e8b5000
                pte:                    ffff88207e8b51c0

                pte.cont:               ffff88207e8c31c0

15th run, found the bug.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
wuklab13 0311-15
[ 1474.477687] dup_task_struct(): current: 32 new: ffff88207e8b5000
..
while pid 33 exit
so the ffff88207e8b5000 is freed

but allocated again by pte_alloc
[ 1481.420200] __pte_alloc():CPU5 for addr: 0x7ffefc6acd90 pte_index: ac new pte page: ffff88207e8b5000

However, we forgot to remove it from group_leader's thread_group

[ 1485.895938]
                p: ffff88207e8c3000
                p->group_leader: ffff88107e19b000(32)

                p->thread_group:        ffff88207e8c31c0
                leader->thread_group:   ffff88107e19b1c0

[ 1486.047784]
                tg->next:               ffff88207e8c31c8
                tg->prev:               ffff88207e8c31c0
                leader->tg->next        ffff88107e19b1c8
                leader->tg->prev        ffff88107e19b1c0

[ 1486.191311]  next                    ffff88107e19b1c0
                prev                    ffff88207e8b51c0
                next                    ffff88107e19b1c0

[ 1486.276594] CPU5 pid:32 caller: __list_add    pte is none index 0x38 line 61 from_addr 0x0 page: 0xffff88207e8b5000
[ 1486.401399] CPU5 pid:32 caller: __list_add    pte is none index 0x38 line 65 from_addr 0x0 page: 0xffff88207e8b5000
[ 1486.526203] CPU5 pid:32 caller: __list_add    pte is none index 0x38 line 69 from_addr 0x0 page: 0xffff88207e8b5000
[ 1486.651010] CPU5 pid:32 caller: __list_add    pte is none index 0x38 line 73 from_addr 0x0 page: 0xffff88207e8b5000

[ 1486.775814] pte:ffff88207e8b51c0 pfn:0x8207e8c3 flags:(dirty|large|global|softw4|pkey0|pkey1|pkey2|pkey3|nx|0x3ff800000000000)
[ 1486.912060] CPU5 pid:32 caller: __list_add line: 77 from_addr: 0x0 pte.cont: 0xffff88207e8c31c0

16th run, damn, after patching __unhash_process(), it finally works. Going to workout, see you tonight.

victim report error

17th run. The phoenix program has bug itself, it is not able to run with 4GB dataset. So try it with 2GB dataset. Uuh, the log is too long. __put_vicim report a victim that has wrong flags. Going to disable the evict log and try again.

18th run. Happen to run seq with 100MB… It actually half finished. But the printf of phoenix has funny chars. I guess memory is corrupted. The log shows it is ib_mad_completion.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
[ 2244.018806] Processor: Processor manager is running.
[ 2246.394568] STDOUT: ---[
envp[0] HOME=/

]---
[ 2246.447719] STDOUT: ---[
envp[1] TERM=linux

]---
[ 2246.507003] STDOUT: ---[
argv[0] /root/ys/phoenix/phoenix-2.0/tests/word_count/word_count-seq

]---
[ 2246.618289] STDOUT: ---[
argv[1] /root/ys/phoenix/phoenix-2.0/tests/word_count/word_count_datafiles/word_100MB.txt

]---
[ 2258.805633] STDOUT: ---[
Word-Count-Seq: Computation Completed 12.46633 sec

]---
[ 2258.923180] SYSC_close(): [4] -> [/proc/meminfo]
[ 2258.995743] STDOUT: ---[
Use len is 123748
[ 2263.484774] STDOUT: ---[
THE: 1115050
]---
[ 2263.666785] STDOUT: ---[
OF: 615296
]---
[ 2266.103660] STDOUT: ---[
AND: 545303 (a lot funny chars, deleted.)
]---
[ 2267.016837] Code: [ 2267.038680] STDOUT: ---[
TO: 475179
+>ÕþÚéØ^G§<87>k<80>z^T<86>ruJ·¿»<9e>éÞíÑrÜÕ^Wå^W*^_{(Ê?Rùaéö÷8í<91>Üè<8f>ò¿i^?è4<94>ײɵ^V¿«ëP]íïh^GÊë<98>^T×Qp¹O®ï^^?^Aí<91>ÙvÝBy^_éiwP^r<97>ëùïß]£ß­<98><81>ø<85>ÎEy^Yå^?Vùº^YÞõË]r5Éð^^'<92>É]^]P^Çi»z:Ô^S
®e<8a>+<8a>®±àEÕÎ,ðÒâ3Á_^P_^H^[|¸®ásíF¿m<95><9d>?<82>ò:¾Þõ3Ê×Tü®
]---
[ 2263.339165] BUG: unable to handle kernel paging request at ffffffffffff8100
[ 2263.422369] IP: [<ffffffffffff8100>] 0xffffffffffff8100
[ 2263.570058] PGD 1140067 PUD 1142067 PMD 0
[ 2263.618942] Oops: 0010 [#1] SMP PROCESSOR
[ 2264.705811] CPU: 0 PID: 27 Comm: ib_mad_completi 4.0.0-lego-ys+ #408
[ 2264.781736] RIP: 0010:[<ffffffffffff8100>]  [<ffffffffffff8100>] 0xffffffffffff8100
[ 2264.873262] RSP: 0000:ffff88107efabc90  EFLAGS: 00010046
[ 2264.936705] RAX: 5636000000000098 RBX: db5affffffffffff RCX: 0000000000000001
[ 2265.021990] RDX: ffff88107efabd38 RSI: 0000000000000000 RDI: 4460ffffffff8114
[ 2265.107277] RBP: ffff88107efabce0 R08: 000000000000001f R09: ffff88107efa43c0
[ 2265.192561] R10: ffff88107efabe68 R11: 0000000000000001 R12: ac02000004ecbdbd
[ 2265.277847] R13: 0000000000000000 R14: ffff88107efa4228 R15: ffff88107e1ab000
[ 2265.363133] FS:  0000000000000000(0000) GS:ffff88107fc00000(0000) knlGS:0000000000000000
[ 2265.459858] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2265.528503] CR2: ffffffffffff8100 CR3: 000000000113d000 CR4: 00000000000406b0
[ 2265.613789] Stack:
[ 2265.637710] ffffffff810151a7 0000000000000082 ffff88107fc04980 0000000000000000
[ 2265.725075] ffff88107efabcc8 ffff88107fc04980 0000000000000000 0000000000000000
[ 2265.812441] ffff88107efa4228 ffff88107e1ab000 ffff88107efabcf8 ffffffff81016e17
[ 2265.899806] 000000007efabe20 ffff88107efabd20 ffffffff810066f4 ffffffff81072f20
[ 2265.987172] ffff88107fc05e00 ffff88107efa4000 ffff88107efabe08 ffffffff8100e4aa
[ 2266.074538] Call Trace:
[ 2266.206626] <TSK>
[ 2266.229507] [<ffffffff810151a7>] ? update_wall_time+0x47/0x6b0
[ 2266.299192] [<ffffffff81016e17>] tick_handle_periodic+0x67/0x70
[ 2266.369916] [<ffffffff810066f4>] apic_timer_interrupt+0x54/0x90
[ 2266.440641] [<ffffffff8100e4aa>] smp__apic_timer_interrupt+0x6a/0x70
[ 2266.516565] [<ffffffff810663b8>] ? __schedule+0xf8/0x1e0
[ 2266.580010] [<ffffffff810664b3>] schedule+0x13/0x30
[ 2266.638254] [<ffffffff81058c97>] ib_mad_completion_handler+0x2b7/0x860
[ 2266.716258] [<ffffffff810589e0>] ? ib_mad_send_done_handler.isra.22+0x1d0/0x1d0
[ 2266.803624] [<ffffffff81020376>] kthread+0xf6/0x110
[ 2266.861867] [<ffffffff81020280>] ? __kthread_parkme+0x70/0x70
[ 2266.930512] [<ffffffff8100e732>] ret_from_fork+0x22/0x30
[ 2266.993955] <EOT>

19th, try seq+100MB again. Well succeed. I guess I start S too later. So that thread has issues. We run 12.3 sec, while linux run 9.7 sec.

20th, try seq+4GB data. Linux runs 314.4 sec. Lego runs 403 sec. But Lego has some clflush error messages. I don’t know why actually.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
[  794.604628] Processor: Processor manager is running.
[  796.884884] STDOUT: ---[
envp[0] HOME=/

]---
[  796.938032] STDOUT: ---[
envp[1] TERM=linux

]---
[  796.997312] STDOUT: ---[
argv[0] /root/ys/phoenix/phoenix-2.0/tests/word_count/word_count-seq

]---
[  797.108596] STDOUT: ---[
argv[1] /root/ys/phoenix/phoenix-2.0/tests/word_count/word_count_datafiles/word_4GB.txt

]---
[  980.640200] __clflush_one(): EFAULT:bad address
[  980.692315] __clflush_one(): EFAULT:bad address
[  980.746397] __clflush_one(): EFAULT:bad address
[  980.800478] __clflush_one(): EFAULT:bad address
[  980.854559] __clflush_one(): EFAULT:bad address
[  980.908642] __clflush_one(): EFAULT:bad address
[  980.962723] __clflush_one(): EFAULT:bad address
[  981.016804] __clflush_one(): EFAULT:bad address
[  981.070886] __clflush_one(): EFAULT:bad address
[  981.124968] __clflush_one(): EFAULT:bad address
[  981.179048] __clflush_one(): EFAULT:bad address
[  981.233129] __clflush_one(): EFAULT:bad address
[  981.287211] __clflush_one(): EFAULT:bad address
[  981.341293] __clflush_one(): EFAULT:bad address
[  981.395375] __clflush_one(): EFAULT:bad address
[  981.449456] __clflush_one(): EFAULT:bad address
[  981.503538] __clflush_one(): EFAULT:bad address
[  981.557619] __clflush_one(): EFAULT:bad address
[  981.611702] __clflush_one(): EFAULT:bad address
[  981.665782] __clflush_one(): EFAULT:bad address
[  981.719863] __clflush_one(): EFAULT:bad address
[  981.773945] __clflush_one(): EFAULT:bad address
[  981.828026] __clflush_one(): EFAULT:bad address
[  981.882108] __clflush_one(): EFAULT:bad address
[  981.936188] __clflush_one(): EFAULT:bad address
[  981.990271] __clflush_one(): EFAULT:bad address
[  982.044352] __clflush_one(): EFAULT:bad address
[  982.098434] __clflush_one(): EFAULT:bad address
[  982.152515] __clflush_one(): EFAULT:bad address
[  982.206596] __clflush_one(): EFAULT:bad address
[ 1200.759741] STDOUT: ---[
Word-Count-Seq: Computation Completed 403.519401 sec

]---
...
[ 1200.989480] STDOUT: ---[
THE: 44602000
...
[ 1201.755779] do_group_exit() pid:32,tgid:32 exit_code:0x0
[ 1201.819136] do_exit() pid:32,tgid:32 code:0x0
[ 1201.872451] nr_pgfault: 1049525
[ 1201.908579] nr_pgfault_wp: 0
[ 1201.942899] nr_pgfault_wp_cow: 0
[ 1201.981380] nr_pgfault_wp_reuse: 0
[ 1202.021941] nr_pgfault_due_to_concurrent_eviction: 0
[ 1202.081223] nr_pcache_fill_from_memory: 1045393
[ 1202.135304] nr_pcache_fill_from_victim: 4132
[ 1202.186265] nr_pcache_eviction: 525230
[ 1202.230987] nr_victim_eviction: 521090

21th run. Do not have time and energy to debug the clflush issue. I just want to run MT+2GB again. Well victim has issues! Some warning are triggered. Log is wuklab13:~/ys/0311-22. Continue tomorrow! Good night world. (Such a lonly phd.)

03/10 Sat

Running python hello world. Tried to make kmalloc use buddy directly.

put_pcache in pcache_zap_pte

So this time, python keep running for a long time. But P crashed when the first time eviction was triggered.

Below is log from S side, those libraries do not exist, so these log are fine:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
S:
[Mar10 10:39] handle_access_request /etc/ld.so.preload 4, -2
[Mar10 10:44] local_file_open : Cannot open required file [/usr/lib64/python2.7/site.so].
[  +0.352839] local_file_open : Cannot open required file [/usr/lib64/python2.7/sitemodule.so].
[ +22.254465] local_file_open : Cannot open required file [/usr/lib64/python2.7/os.so].
[  +0.350759] local_file_open : Cannot open required file [/usr/lib64/python2.7/osmodule.so].
[Mar10 10:45] local_file_open : Cannot open required file [/usr/lib64/python2.7/posixpath.so].
[  +0.358045] local_file_open : Cannot open required file [/usr/lib64/python2.7/posixpathmodule.so].
[ +13.421033] local_file_open : Cannot open required file [/usr/lib64/python2.7/stat.so].
[  +0.352838] local_file_open : Cannot open required file [/usr/lib64/python2.7/statmodule.so].
[Mar10 10:46] local_file_open : Cannot open required file [/usr/lib64/python2.7/genericpath.so].
[  +0.360126] local_file_open : Cannot open required file [/usr/lib64/python2.7/genericpathmodule.so].
[ +11.582165] local_file_open : Cannot open required file [/usr/lib64/python2.7/warnings.so].
[  +0.357003] local_file_open : Cannot open required file [/usr/lib64/python2.7/warningsmodule.so].
[ +11.989828] local_file_open : Cannot open required file [/usr/lib64/python2.7/linecache.so].
[  +0.358043] local_file_open : Cannot open required file [/usr/lib64/python2.7/linecachemodule.so].
[Mar10 10:47] local_file_open : Cannot open required file [/usr/lib64/python2.7/types.so].
[  +0.353879] local_file_open : Cannot open required file [/usr/lib64/python2.7/typesmodule.so].

Weird P’s bug, seems like the pcm returned by evict_find_line has issue. Well, I’m trying to debug what is going with this set.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
wuklab13 0310-2
[ 1046.880649] SYSC_read() cpu(5) tsk(32/32/python) user-ip:0x7ffff6e117e0
[ 1046.959692]     fd: 8, buf: 00007ffff7ffb000, count: 4096
[ 1048.726624] pcache_evict_line(): pset: ffff88207f9ffec0, for uva: 0x7ffff7ffb000
[ 1048.813053] ------------[ cut here ]------------
[ 1048.868174] BUG: failure at ./include/processor/pcache.h:284/pcache_meta_to_pcache_set()!
[ 1048.965937] Kernel Panic - not syncing: BUG!
[ 1049.016898] CPU: 5 PID: 32 Comm: python 4.0.0-lego-ys+ #347
[ 1049.083460] Stack:
[ 1049.107380] ffff88107e18fca8 ffffffff81026f1c 0000000000000008 ffff88107e18fcb8
[ 1049.194743] ffff88107e18fc70 0000000021475542 0000000000000000 0000000000000000
[ 1049.282107] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 1049.369468] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 1049.456832] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 1049.544193] Call Trace:
[ 1049.573315] <TSK>
[ 1049.596195] [<ffffffff81026f28>] panic+0xc2/0xeb
[ 1049.651318] [<ffffffff8101c3fc>] ? task_tick_rt+0x2c/0xd0
[ 1049.715799] [<ffffffff81019a65>] ? scheduler_tick+0x55/0x60
[ 1049.782360] [<ffffffff81017035>] ? tick_handle_periodic+0x45/0x70
[ 1049.855163] [<ffffffff81006764>] ? apic_timer_interrupt+0x54/0x90
[ 1049.927966] [<ffffffff8101c3fc>] ? task_tick_rt+0x2c/0xd0
[ 1049.992447] [<ffffffff81019a65>] ? scheduler_tick+0x55/0x60
[ 1050.059009] [<ffffffff81017035>] ? tick_handle_periodic+0x45/0x70
[ 1050.131812] [<ffffffff8103c41a>] ? put_dec+0x1a/0x80
[ 1050.191093] [<ffffffff81006764>] ? apic_timer_interrupt+0x54/0x90
[ 1050.263895] [<ffffffff8100e56a>] ? smp__apic_timer_interrupt+0x6a/0x70
[ 1050.341897] [<ffffffff81012ded>] ? printk+0x11d/0x1b0
[ 1050.402219] [<ffffffff810340c5>] dump_pcache_meta+0xc5/0xd0
[ 1050.468782] [<ffffffff81034588>] pcache_evict_line+0x158/0x220
[ 1050.538463] [<ffffffff81030f5e>] pcache_alloc+0x22e/0x2f0
[ 1050.602945] [<ffffffff8103015a>] common_do_fill_page+0x2a/0x430
[ 1050.673668] [<ffffffff8102fb20>] ? pcache_meta_to_kva+0x30/0x30
[ 1050.744389] [<ffffffff81030702>] pcache_handle_fault+0x1a2/0x6c0
[ 1050.816152] [<ffffffff810103d2>] do_page_fault+0xa2/0x1a0
[ 1050.880634] [<ffffffff8100db9f>] page_fault+0x1f/0x30
[ 1050.940955] [<ffffffff8103bb82>] ? copy_user_enhanced_fast_string+0x2/0x10
[ 1051.023118] [<ffffffff81038423>] ? normal_p2m_read+0x233/0x330
[ 1051.092800] [<ffffffff810363ce>] sys_read+0x9e/0x160
[ 1051.152081] [<ffffffff810268d0>] ? strace_enter_default+0x30/0x40
[ 1051.224884] [<ffffffff8100e935>] do_syscall_64+0x45/0xd0
[ 1051.288326] [<ffffffff8100d82c>] entry_SYSCALL64_slow_path+0x25/0x25

Interesting, added several debug messages. The bug is I forgot to put_pcache when a rmap was zapped. One rmap counts one refcount (effectively one process), thus when a rmap was zapped, we should decrease the refcount. I found I’ve already done so for pcache_remove_rmap, and pcache_move_pte. But damn, forgot this one. I remember this code was written before fork+pcache. So.. I don’t have a big picture at that time. Multithreaded system plus background reclaim really a very rigours design usage of refcount and lock.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
[ 1418.038411] CPU5 PID32 sys_read+0x0/0xa0
[ 1418.085227] pcache_evict_line(): pset: ffff88207f9ffec0, for uva: 0x7ffff7ffb000
[ 1418.173617] pset:ffff88207f9ffec0 set_idx: 32763 nr_lru:8
[ 1418.238105] pcache:ffff8801801ffec0 mapcount:0 refcount:1 flags:(allocated|usable) kva: ffff880107ffb000
[ 1418.351476] pcache:ffff8801805ffec0 mapcount:0 refcount:1 flags:(allocated|usable) kva: ffff880117ffb000
[ 1418.464847] pcache:ffff8801809ffec0 mapcount:0 refcount:1 flags:(allocated|usable) kva: ffff880127ffb000
[ 1418.578220] pcache:ffff880180dffec0 mapcount:0 refcount:1 flags:(allocated|usable) kva: ffff880137ffb000
[ 1418.691591] pcache:ffff8801811ffec0 mapcount:0 refcount:1 flags:(allocated|usable) kva: ffff880147ffb000
[ 1418.804963] pcache:ffff8801815ffec0 mapcount:0 refcount:1 flags:(allocated|usable) kva: ffff880157ffb000
[ 1418.918334] pcache:ffff8801819ffec0 mapcount:0 refcount:1 flags:(allocated|usable) kva: ffff880167ffb000
[ 1419.031706] pcache:ffff880181dffec0 mapcount:0 refcount:1 flags:(allocated|usable) kva: ffff880177ffb000
[ 1419.145077] After dump pset
[ 1419.176280] pcache:ffff8801801ffec0 mapcount:0 refcount:1 flags:(allocated|usable) kva: ffff880107ffb000
[ 1419.289652] pcache dumped because: evict_find_line_lru
[ 1419.351018] pcache:ffff8801805ffec0 mapcount:0 refcount:1 flags:(allocated|usable) kva: ffff880117ffb000
[ 1419.464389] pcache dumped because: evict_find_line_lru
[ 1419.525757] pcache:ffff8801809ffec0 mapcount:0 refcount:1 flags:(allocated|usable) kva: ffff880127ffb000
[ 1419.639127] pcache dumped because: evict_find_line_lru
[ 1419.700494] pcache:ffff880180dffec0 mapcount:0 refcount:1 flags:(allocated|usable) kva: ffff880137ffb000
[ 1419.813865] pcache dumped because: evict_find_line_lru
[ 1419.875231] pcache:ffff8801811ffec0 mapcount:0 refcount:1 flags:(allocated|usable) kva: ffff880147ffb000
[ 1419.988604] pcache dumped because: evict_find_line_lru
[ 1420.049969] pcache:ffff8801815ffec0 mapcount:0 refcount:1 flags:(allocated|usable) kva: ffff880157ffb000
[ 1420.163341] pcache dumped because: evict_find_line_lru
[ 1420.224708] pcache:ffff8801819ffec0 mapcount:0 refcount:1 flags:(allocated|usable) kva: ffff880167ffb000
[ 1420.338079] pcache dumped because: evict_find_line_lru
[ 1420.399445] pcache:ffff880181dffec0 mapcount:0 refcount:1 flags:(allocated|usable) kva: ffff880177ffb000
[ 1420.512817] pcache dumped because: evict_find_line_lru
[ 1420.574183] evict_find_line_lru(): pcm: ffff88207f9ffea8
[ 1420.637631] ------------[ cut here ]------------
[ 1420.692756] BUG: failure at ./include/processor/pcache.h:340/pcache_meta_to_kva()!
[ 1420.783245] Kernel Panic - not syncing: BUG!
[ 1420.834210] CPU: 5 PID: 32 Comm: python 4.0.0-lego-ys+ #349
[ 1420.900777] Stack:

python hello world run to end

Glad to say, python hello world finished, even with some missed syscalls. Especially the stdin stuff, so the string is actually not printed out. Log is wuklab13:~/ys/0310-4

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
[ 3149.540308] CPU5 PID32 sys_ioctl+0x0/0x10
[ 3149.588144] CPU5 PID32 sys_ioctl+0x0/0x10
[ 3149.635982] CPU5 PID32 sys_write+0x0/0xa0
[ 3149.683818] STDOUT: ---[
>>>
]---
[ 3149.726456] __pcache_do_fill_page(): I pid:32 tgid:32 address:0x7ffff6d9aeb0 flags:0x150
[ 3149.926247] __pcache_do_fill_page(): O pid:32 tgid:32 address:0x7ffff6d9aeb0 flags:0x150 ret:0(OKAY)
[ 3150.033464] CPU5 PID32 sys_newfstat+0x0/0x10
[ 3150.084420] CPU5 PID32 sys_ioctl+0x0/0x10
[ 3150.132256] strace__mmap cpu5 addr=0x0, len=0x1000, prot(0x3)=PROT_READ|PROT_WRITE, flags(0x22)=MAP_PRIVATE|MAP_ANONYMOUS, fd=18446744073709551615( ), off=0x0
[ 3150.301772] CPU5 PID32 sys_read+0x0/0xa0
[ 3150.348562] ------------[ cut here ]------------
[ 3150.403679] WARNING: CPU: 5 PID: 32 at managers/processor/fs/stdio.c:24 stdio_file_read+0x30/0x50
[ 3150.509751] Process wants STDIN!
[ 3150.546149] CPU: 5 PID: 32 Comm: python 4.0.0-lego-ys+ #352
[ 3150.612705] Stack:
[ 3150.636624] ffff88107e18fe90 ffffffff81012b15 ffffffff811464e0 00007ffff7ffb000
[ 3150.723977] 0000000000000400 00007ffff70e5640 ffff88107e18fef0 ffffffff81012bd2
[ 3150.811331] ffffffff81079d6b ffff881000000018 ffff88107e18ff00 ffff88107e18fec0
[ 3150.898687] 0000000000000020 ffffffff810346b0 0000000000000022 ffffffff811464f0
[ 3150.986040] 00007ffff7fdf740 0000000000000000 ffff88107e18ff00 ffffffff81035ac0
[ 3151.073394] Call Trace:
[ 3151.102514] <TSK>
[ 3151.125392] [<ffffffff81012b21>] __warn.constprop.0+0x91/0xd0
[ 3151.194028] [<ffffffff81012bd2>] warn_slowpath_fmt+0x42/0x50
[ 3151.261623] [<ffffffff810346b0>] ? sweep_pset_lru+0x220/0x220
[ 3151.330259] [<ffffffff81035ac0>] stdio_file_read+0x30/0x50
[ 3151.395775] [<ffffffff810346e3>] sys_read+0x33/0xa0
[ 3151.454010] [<ffffffff8100e875>] do_syscall_64+0x45/0xd0
[ 3151.517446] [<ffffffff8100d76c>] entry_SYSCALL64_slow_path+0x25/0x25
[ 3151.593362] <EOT>
[ 3151.616240] ---[ end trace 0000000000000000 ]---
[ 3151.671360] CPU5 PID32 sys_write+0x0/0xa0
[ 3151.719194] STDOUT: ---[


]---
[ 3151.759756] CPU5 PID32 sys_close+0x0/0x140
[ 3151.808628] SYSC_close(): [3] -> [/root/ys/py_hello.py]
[ 3151.871028] __pcache_do_fill_page(): I pid:32 tgid:32 address:0x7ffff7a79380 flags:0x150
[ 3152.070817] __pcache_do_fill_page(): O pid:32 tgid:32 address:0x7ffff7a79380 flags:0x150 ret:0(OKAY)
[ 3152.178033] CPU5 PID32 sys_rt_sigaction+0x0/0xb0
[ 3152.234151] __pcache_do_fill_page(): I pid:32 tgid:32 address:0x7ffff7a77f60 flags:0x150
[ 3152.432941] __pcache_do_fill_page(): O pid:32 tgid:32 address:0x7ffff7a77f60 flags:0x150 ret:0(OKAY)
[ 3152.540242] __pcache_do_fill_page(): I pid:32 tgid:32 address:0x7ffff73ee794 flags:0x150
[ 3152.739952] __pcache_do_fill_page(): O pid:32 tgid:32 address:0x7ffff73ee794 flags:0x150 ret:0(OKAY)
[ 3152.847171] __pcache_do_fill_page(): I pid:32 tgid:32 address:0x7ffff715b278 flags:0x150
[ 3153.046958] __pcache_do_fill_page(): O pid:32 tgid:32 address:0x7ffff715b278 flags:0x150 ret:0(OKAY)
[ 3153.154179] __pcache_do_fill_page(): I pid:32 tgid:32 address:0x7ffff6de74f0 flags:0x150
[ 3153.353965] __pcache_do_fill_page(): O pid:32 tgid:32 address:0x7ffff6de74f0 flags:0x150 ret:0(OKAY)
[ 3153.461180] CPU5 PID32 sys_exit_group+0x0/0x10

Trying phoenix pthread again

4GB pcache, 1GB dataset.

1th run with CONFIG_STRACE on, 1GB dataset finished, result is correct.

2th run without CONFIG_STRACE, 1GB dataset stuck. Two weird things:

  • open/close dev/cpu/online file too many times than a normal linux run
  • IB stucked So next I’m going to try add a lock to ibapi, see if it is ib internal deadlock issue.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
wuklab13 0310-7
[  702.895936] Processor: Processor manager is running.
[  722.400159] STDOUT: ---[
envp[0] HOME=/

]---
[  722.453307] STDOUT: ---[
envp[1] TERM=linux

]---
[  722.512589] STDOUT: ---[
argv[0] /root/ys/phoenix/phoenix-2.0/tests/word_count/word_count-pthread

]---
[  722.628036] STDOUT: ---[
argv[1] /root/ys/phoenix/phoenix-2.0/tests/word_count/word_count_datafiles/word_1GB.txt

]---
[  722.759101] STDOUT: ---[
Wordcount: Running...
]---
[  722.819406] STDOUT: ---[


]---
[  722.860139] SYSC_close(): [4] -> [/sys/devices/system/cpu/online]
[  722.940653] SYSC_close(): [4] -> [/sys/devices/system/cpu/online]
[  723.011483] SYSC_close(): [4] -> [/sys/devices/system/cpu/online]
[  723.084287] SYSC_close(): [4] -> [/sys/devices/system/cpu/online]
[  723.157090] SYSC_close(): [4] -> [/sys/devices/system/cpu/online]
[  723.229894] SYSC_close(): [4] -> [/sys/devices/system/cpu/online]
[  723.302698] SYSC_close(): [4] -> [/sys/devices/system/cpu/online]
[  723.375502] SYSC_close(): [4] -> [/sys/devices/system/cpu/online]
[  723.448306] SYSC_close(): [4] -> [/sys/devices/system/cpu/online]
[  723.521111] SYSC_close(): [4] -> [/sys/devices/system/cpu/online]
[  723.593914] SYSC_close(): [4] -> [/sys/devices/system/cpu/online]
[  723.666718] SYSC_close(): [4] -> [/sys/devices/system/cpu/online]
[  723.739522] SYSC_close(): [4] -> [/sys/devices/system/cpu/online]
[  723.812326] SYSC_close(): [4] -> [/sys/devices/system/cpu/online]
[  723.885130] SYSC_close(): [4] -> [/sys/devices/system/cpu/online]
[  766.701260] ibapi_send_reply() polling timeout (30010 ms), caller: net_send_reply_timeout+0x11b/0x1ee
[  766.809538]  net_send_reply_timeout() caller: __pcache_do_fill_page+0x82/0x140
[  766.895863] word_count-pthr[65]: segfault at 0x7fffb5eba000 ip 000000000040249d sp 00007fffb5e9ad80 error 6
[  767.012348] CPU: 15 PID: 65 Comm: word_count-pthr 4.0.0-lego-ys+ #359
[  767.089312] RIP: 0033:[<000000000040249d>]  [<000000000040249d>] 0x40249d
[  767.170436] RSP: 002b:00007fffb5e9ad80  EFLAGS: 00010216
[  767.233879] RAX: 00007fffb5eba000 RBX: 0000000000001388 RCX: 000000000000004f
[  767.319164] RDX: 00007fffe4ea92a4 RSI: 00007fffe626fac9 RDI: 00007fffe4ea92a4
[  767.404449] RBP: 00000000007540e0 R08: 0000000000000000 R09: 0000000000014fa0
[  767.489733] R10: 0000000000427fb0 R11: 0000000000000202 R12: 0000000000012b12
[  767.575018] R13: 00007fff496ab890 R14: 00007fff48704fb0 R15: 0000000000001388
[  767.660303] FS:  00007fffb5e9b700(0000) GS:ffff88207fce0000(0000) knlGS:0000000000000000
[  767.757028] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  767.825671] CR2: 00007fffb5eba000 CR3: 000000207fe3a000 CR4: 00000000000406a0
[  767.910958] get_signal(): dequeue_signr: 11, handler:          (null)
[  767.987928] get_signal(): dequeue_signr: 9, handler:          (null)

3th run, without STRACE, with locked ibapi, it finished, result is correct. Runtime: 18.692936 sec.

1
2
3
4
5
6
7
8
9
[  555.423623] nr_pgfault: 288100
[  555.458042] nr_pgfault_wp: 0
[  555.492360] nr_pgfault_wp_cow: 0
[  555.530838] nr_pgfault_wp_reuse: 0
[  555.571396] nr_pgfault_due_to_concurrent_eviction: 0
[  555.630673] nr_pcache_fill_from_memory: 288081
[  555.683710] nr_pcache_fill_from_victim: 12
[  555.732588] nr_pcache_eviction: 494
[  555.774187] nr_victim_eviction: 474

4th run, same setting with the 3th run, same result. But the nr_pgfault differs, I guess it is due to runtime things. Runtime: 19.12861 sec.

1
2
3
4
5
6
7
8
9
[  469.891700] nr_pgfault: 288119
[  469.926123] nr_pgfault_wp: 0
[  469.960444] nr_pgfault_wp_cow: 0
[  469.998924] nr_pgfault_wp_reuse: 0
[  470.039484] nr_pgfault_due_to_concurrent_eviction: 0
[  470.098764] nr_pcache_fill_from_memory: 288093
[  470.151805] nr_pcache_fill_from_victim: 12
[  470.200684] nr_pcache_eviction: 513
[  470.242285] nr_victim_eviction: 493

5th run, same with 4th, succeed, Runtime: 18.653879 sec.

1
2
3
4
5
6
7
8
9
[  313.202348] nr_pgfault: 288070
[  313.236772] nr_pgfault_wp: 0
[  313.271093] nr_pgfault_wp_cow: 0
[  313.309575] nr_pgfault_wp_reuse: 0
[  313.350139] nr_pgfault_due_to_concurrent_eviction: 0
[  313.409421] nr_pcache_fill_from_memory: 288052
[  313.462465] nr_pcache_fill_from_victim: 6
[  313.510307] nr_pcache_eviction: 446
[  313.551909] nr_victim_eviction: 432

6th, setting is the same, but with 4GB dataset, crashed:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
[  512.028141] Processor: Processor manager is running.
[  529.375605] STDOUT: ---[
Wordcount: Running...
]---
[  529.435906] STDOUT: ---[


]---
[  529.476660] SYSC_close(): [4] -> [/sys/devices/system/cpu/online]
[  529.555983] ------------[ cut here ]------------
[  529.609128] BUG: failure at managers/processor/pcache/rmap.c:735/pcache_zap_pte()!
[  529.699613] Kernel Panic - not syncing: BUG!
[  529.750576] CPU: 5 PID: 32 Comm: word_count-pthr 4.0.0-lego-ys+ #361
[  529.826500] Stack:
[  529.850422] ffff88107e1a3dd8 ffffffff810259b4 0000000000000008 ffff88107e1a3de8
[  529.937787] ffff88107e1a3da0 0000000021475542 0000000000000000 0000000000000000
[  530.025152] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  530.112517] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  530.199882] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  530.287247] Call Trace:
[  530.316370] <TSK>
[  530.339251] [<ffffffff810259c0>] panic+0xc2/0xeb
[  530.394374] [<ffffffff8106190a>] ? client_internal_poll_sendcq+0x2a/0x80
[  530.474458] [<ffffffff8101bfcc>] ? task_tick_rt+0x2c/0xd0
[  530.538943] [<ffffffff81019725>] ? scheduler_tick+0x55/0x60
[  530.605506] [<ffffffff81016df5>] ? tick_handle_periodic+0x45/0x70
[  530.678311] [<ffffffff8103768a>] ? put_dec+0x1a/0x80
[  530.737595] [<ffffffff810066f4>] ? apic_timer_interrupt+0x54/0x90
[  530.810398] [<ffffffff8100e4aa>] ? smp__apic_timer_interrupt+0x6a/0x70
[  530.888403] [<ffffffff81012ccd>] ? printk+0x11d/0x1b0
[  530.948726] [<ffffffff81030429>] pcache_zap_pte+0xf9/0x160
[  531.014250] [<ffffffff8102f090>] ? __pcache_move_pte_fastpath+0x50/0x50
[  531.093295] [<ffffffff8102c8dc>] unmap_page_range+0x32c/0x3b0
[  531.161940] [<ffffffff8102c97e>] release_pgtable+0x1e/0x40
[  531.227463] [<ffffffff8102bfb3>] sys_munmap+0xc3/0x120
[  531.288827] [<ffffffff8100e86d>] do_syscall_64+0x3d/0xc0
[  531.352270] [<ffffffff8100d76c>] entry_SYSCALL64_slow_path+0x25/0x25

7th run, add debug info, does not seem that useful:

1
2
3
4
5
6
7
8
9
]---
[15755.579501] SYSC_close(): [4] -> [/sys/devices/system/cpu/online]
[15755.672760] pte:ffff88107e1a3dd8 pfn:0x8207e80b flags:(dirty|large|global|softw4|pkey0|pkey1|pkey2|pkey3|nx|0x3ff800000000000)
[15755.807015] pte dumped because: Invalid pte
[15755.856932] address: 0x7ffefc638000
[15755.899569] ------------[ cut here ]------------
[15755.954684] BUG: failure at managers/processor/pcache/rmap.c:747/pcache_zap_pte()!
[15756.045159] Kernel Panic - not syncing: BUG!
[15756.096114] CPU: 5 PID: 32 Comm: word_count-pt

Tried several times, even with mmap/munmap debug option on, it crashed at the same point. Key is address 0x7ffefc638000, and the mmap() related to it.

Close to find the bug. Latest log in 0310-18.


03/09 Fri

Find bug in kmalloc

Tried to print pud in every syscall and catch the criminal:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
wuklab13 0309-1
[  320.088684] CPU5 PID32 sys_close+0x0/0x1f0
[  320.137567] do_syscall_64(): enter pgd ffff88207fccf000, pgd.cont_va ffff88207fc6f000, pud_index=0x0 pud: ffff88207fc6f000
[  320.269657] SYSC_close() cpu(5) tsk(32/32/python) user-ip:0x7ffff7df3c37
[  320.349742]     3
[  320.372624] SYSC_close(): [3] -> [/lib64/libpython2.7.so.1.0]
[  320.441268] SYSC_close() cpu(5) tsk(32/32/python) ret: 0x0 (0)
[  320.510954] do_syscall_64(): leave pgd ffff88207fccf000, pgd.cont_va ffff88207fc6f000, pud_index=0x0 pud: ffff88207fc6f000

[  320.643043]     addr: 0x7ffff7a101f0, pgd: ffff88207fccf7f8
[  320.709607]     addr: 0x7ffff7a101f0, pgd: ffff88207fccf7f8 pud ffff88207fcaeff8
[  320.798014] __pcache_do_fill_page(): I pid:32 tgid:32 address:0x7ffff7a101f0 flags:0x50
[  320.995755] __pcache_do_fill_page(): O pid:32 tgid:32 address:0x7ffff7a101f0 flags:0x50 ret:0(OKAY)

[  321.101944]     addr: 0x7ffff7a21749, pgd: ffff88207fccf7f8
[  321.168509]     addr: 0x7ffff7a21749, pgd: ffff88207fccf7f8 pud ffff88207fcaeff8
[  321.256914] __pcache_do_fill_page(): I pid:32 tgid:32 address:0x7ffff7a21749 flags:0x50
[  321.454651] __pcache_do_fill_page(): O pid:32 tgid:32 address:0x7ffff7a21749 flags:0x50 ret:0(OKAY)

[  321.560845]     addr: 0x7ffff7ff2fda, pgd: ffff88207fccf7f8
[  321.627409]     addr: 0x7ffff7ff2fda, pgd: ffff88207fccf7f8 pud ffff88207fcaeff8
[  321.715815] __pcache_do_fill_page(): I pid:32 tgid:32 address:0x7ffff7ff2fda flags:0x50
[  321.913553] __pcache_do_fill_page(): O pid:32 tgid:32 address:0x7ffff7ff2fda flags:0x50 ret:0(OKAY)

[  322.019745] CPU5 PID32 sys_open+0x0/0x10
[  322.066548] do_syscall_64(): enter pgd ffff88207fccf000, pgd.cont_va ffff9001801ff000, pud_index=0x0 pud: ffff9001801ff000
[  322.198638] SYSC_open() cpu(5) tsk(32/32/python) user-ip:0x7ffff7df3b27
[  322.277683]     f_name: /lib64/libpthread.so.0, flags: 80000, mode: e150
[  322.357780] SYSC_open() cpu(5) tsk(32/32/python) ret: 0x3 (3)
[  322.426414] do_syscall_64(): leave pgd ffff88207fccf000, pgd.cont_va ffff9001801ff000, pud_index=0x0 pud: ffff9001801ff000

After printing more in pcache_handle_fault, I found who corrupted pgtable:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
wuklab13 0309-5
[  661.308584] CPU5 PID32 sys_close+0x0/0x1f0
[  661.357466] do_syscall_64(): enter pgd ffff88207fccf000, pgd.cont_va ffff88207fcae000, pud_index=0x0 pud: ffff88207fcae000
[  661.489557] SYSC_close() cpu(5) tsk(32/32/python) user-ip:0x7ffff7df3c37
[  661.569642]     3    
[  661.592525] SYSC_close(): [3] -> [/lib64/libpython2.7.so.1.0]
[  661.661170] SYSC_close() cpu(5) tsk(32/32/python) ret: 0x0 (0)
[  661.730854] do_syscall_64(): leave pgd ffff88207fccf000, pgd.cont_va ffff88207fcae000, pud_index=0x0 pud: ffff88207fcae000
[  661.862944] pcache_handle_fault(): enter pgd ffff88207fccf000, pgd.cont_va ffff88207fcae000, pud_index=0x0 pud: ffff88207fcae000
[  662.001275]     addr: 0x7ffff7a101f0, pgd: ffff88207fccf7f8
[  662.067840]     addr: 0x7ffff7a101f0, pgd: ffff88207fccf7f8 pud ffff88207fcafff8
[  662.156247] __pcache_do_fill_page(): I pid:32 tgid:32 address:0x7ffff7a101f0 flags:0x50
[  662.353985] __pcache_do_fill_page(): O pid:32 tgid:32 address:0x7ffff7a101f0 flags:0x50 ret:0(OKAY)
[  662.460176] pcache_handle_fault(): leave pgd ffff88207fccf000, pgd.cont_va ffff88207fcae000, pud_index=0x0 pud: ffff88207fcae000

[  662.600586] pcache_handle_fault(): enter pgd ffff88207fccf000, pgd.cont_va ffff88207fcae000, pud_index=0x0 pud: ffff88207fcae000
[  662.738916]     addr: 0x7ffff7a21749, pgd: ffff88207fccf7f8
[  662.805481]     addr: 0x7ffff7a21749, pgd: ffff88207fccf7f8 pud ffff88207fcafff8
[  662.893888] __pcache_do_fill_page(): I pid:32 tgid:32 address:0x7ffff7a21749 flags:0x50
[  663.091636] __pcache_do_fill_page(): O pid:32 tgid:32 address:0x7ffff7a21749 flags:0x50 ret:0(OKAY)
[  663.197831] pcache_handle_fault(): leave pgd ffff88207fccf000, pgd.cont_va ffff88207fcae000, pud_index=0x0 pud: ffff88207fcae000

[  663.338242] pcache_handle_fault(): enter pgd ffff88207fccf000, pgd.cont_va ffff88207fcae000, pud_index=0x0 pud: ffff88207fcae000
[  663.476572]     addr: 0x7ffff7ff2fda, pgd: ffff88207fccf7f8
[  663.543135]     addr: 0x7ffff7ff2fda, pgd: ffff88207fccf7f8 pud ffff88207fcafff8
[  663.631543] __pcache_do_fill_page(): I pid:32 tgid:32 address:0x7ffff7ff2fda flags:0x50
[  663.829279] __pcache_do_fill_page(): O pid:32 tgid:32 address:0x7ffff7ff2fda flags:0x50 ret:0(OKAY)
[  663.935472] pcache_handle_fault(): leave pgd ffff88207fccf000, pgd.cont_va ffff9001801ff000, pud_index=0x0 pud: ffff9001801ff000

[  664.075884] CPU5 PID32 sys_open+0x0/0x10
[  664.122686] do_syscall_64(): enter pgd ffff88207fccf000, pgd.cont_va ffff9001801ff000, pud_index=0x0 pud: ffff9001801ff000
[  664.254776] SYSC_open() cpu(5) tsk(32/32/python) user-ip:0x7ffff7df3b27
[  664.333821]     f_name: /lib64/libpthread.so.0, flags: 80000, mode: e150
[  664.413918] SYSC_open() cpu(5) tsk(32/32/python) ret: 0x3 (3)
[  664.482552] do_syscall_64(): leave pgd ffff88207fccf000, pgd.cont_va ffff9001801ff000, pud_index=0x0 pud: ffff9001801ff000

Then, try catching bug with address 0x7ffff7ff2fda fault. Printing still being the most effective way to debug. :-)

Dig further, I found pgtable corrupted after pcache_add_rmap(), namely after alloc_pcache_rmap():

1
2
3
[ 5024.482570] pcache_add_rmap() 343 pgd ffff88207fccf000, pgd.cont_va ffff88207fcae000, pud_index=0x0 pud: ffff88207fcae000
[ 5024.613601] alloc_pcache_rmap(): size: 56, rmap: ffff88207fccefd0
[ 5024.686396] pcache_add_rmap() 358 pgd ffff88207fccf000, pgd.cont_va ffff90207fcce000, pud_index=0x0 pud: ffff90207fcce000

Well, rmap: ffff88207fccefd0 & ffff90207fcce000, clearly

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
[  843.916517] pcache_add_rmap() 372 pgd ffff88207fccf000, pgd.cont_va ffff88207fcae000, pud_index=0x0 pud: ffff88207fcae000
[  844.047557] alloc_pcache_rmap() 60 pgd ffff88207fccf000, pgd.cont_va ffff88207fcae000, pud_index=0x0 pud: ffff88207fcae000
[  844.179638] alloc_pcache_rmap(): size: 56, rmap: ffff88207fccefd0
[  844.252438] alloc_pcache_rmap() 71 pgd ffff88207fccf000, pgd.cont_va ffff88207fcae000, pud_index=0x0 pud: ffff88207fcae000
[  844.384517] alloc_pcache_rmap(): size: 56, rmap: ffff88207fccefd0
[  844.457317] alloc_pcache_rmap() 85 pgd ffff88207fccf000, pgd.cont_va ffff90207fcce000, pud_index=0x0 pud: ffff90207fcce000
[  844.589398] pcache_add_rmap() 387 pgd ffff88207fccf000, pgd.cont_va ffff90207fcce000, pud_index=0x0 pud: ffff90207fcce000

46 static struct pcache_rmap *alloc_pcache_rmap(void)
47 {
48         struct pcache_rmap *rmap;
49
50         pgd_t *pgd;
51         pud_t *pud;
52         unsigned long addr;
53         struct mm_struct *mm = current->mm;
54
55         if (pall) {
56                 addr = 0x601008;
57                 pgd = pgd_offset(mm, addr);
58                 pud = pud_alloc(mm, pgd, addr);
59                 pr_info("%s() %d pgd %p, pgd.cont_va %lx, pud_index=%#lx pud: %p\n",
60                         __func__, __LINE__, pgd, pgd_page_vaddr(*pgd), pud_index(addr), (void *)pud);
61         }
62
63         rmap = kmalloc(sizeof(*rmap), GFP_KERNEL);
64
65         if (pall) {
66                 addr = 0x601008;
67                 pgd = pgd_offset(mm, addr);
68                 pud = pud_alloc(mm, pgd, addr);
69                 pr_info("%s(): size: %zu, rmap: %p\n", __func__, sizeof(*rmap), rmap);
70                 pr_info("%s() %d pgd %p, pgd.cont_va %lx, pud_index=%#lx pud: %p\n",
71                         __func__, __LINE__, pgd, pgd_page_vaddr(*pgd), pud_index(addr), (void *)pud);
72         }
73
74         if (rmap) {
75                 INIT_LIST_HEAD(&rmap->next);
76                 rmap->flags = 0;
77         }
78
79         if (pall) {
80                 addr = 0x601008;
81                 pgd = pgd_offset(mm, addr);
82                 pud = pud_alloc(mm, pgd, addr);
83                 pr_info("%s(): size: %zu, rmap: %p\n", __func__, sizeof(*rmap), rmap);
84                 pr_info("%s() %d pgd %p, pgd.cont_va %lx, pud_index=%#lx pud: %p\n",
85                         __func__, __LINE__, pgd, pgd_page_vaddr(*pgd), pud_index(addr), (void *)pud);
86         }
87
88         return rmap;
89 }

Narrow it down to INIT_LIST_HEAD:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
[ 1334.548682] alloc_pcache_rmap(): size: 56, rmap: ffff88207fccefd0
[ 1334.621487] alloc_pcache_rmap() 71 pgd ffff88207fccf000, pgd.cont_va ffff88207fcae000, pud_index=0x0 pud: ffff88207fcae000
[ 1334.753576] alloc_pcache_rmap() 76 &rmap->next ffff88207fcceff8 &flags ffff88207fccefd8
[ 1334.922067] alloc_pcache_rmap() 86 pgd ffff88207fccf000, pgd.cont_va ffff90207fcce000, pud_index=0x0 pud: ffff90207fcce000
[ 1335.126962] alloc_pcache_rmap() 98 pgd ffff88207fccf000, pgd.cont_va ffff90207fcce000, pud_index=0x0 pud: ffff90207fcce000

74         if (rmap) {
75         pr_info("%s() %d &rmap->next %p &flags %p\n",
76                 __func__, __LINE__, &rmap->next, &rmap->flags);
77
78                 INIT_LIST_HEAD(&rmap->next);
79
80         if (pall) {
81                 addr = 0x601008;
82                 pgd = pgd_offset(mm, addr);
83                 pud = pud_alloc(mm, pgd, addr);
84                 pr_info("%s(): size: %zu, rmap: %p\n", __func__, sizeof(*rmap), rmap);
85                 pr_info("%s() %d pgd %p, pgd.cont_va %lx, pud_index=%#lx pud: %p\n",
86                         __func__, __LINE__, pgd, pgd_page_vaddr(*pgd), pud_index(addr), (void *)pud);
87         }
88
89                 rmap->flags = 0;
90         }

Seriously, if this is running on user-level on VM, I would be able to find the bug maybe in 30min. But I spent several hours to find it out with physical machine. Damn you physical machine.

Hmm, this func is used A LOT. How can it fail at this point? Possible reasons:

  • kmalloced area happen to intersect with pgtable?
  • one physical page is mapped twice? one to pgtable, one by this rmap.
  • tty/serial code has bug? Really ancient code.

After add a few printk, IB seems stuck. And this happens just with few more lines of code! Why? code size matters?

1
2
3
4
5
6
7
8
[  722.381469] pcache_handle_fault(): enter pgd ffff88207fccf000, pgd.cont_va ffff88207fcae000, pud_index=0x0 pud: ffff88207fcae000
[  722.519778]     addr: 0x7ffff7feffcc, pgd: ffff88207fccf7f8
[  722.586334]     addr: 0x7ffff7feffcc, pgd: ffff88207fccf7f8 pud ffff88207fcafff8
[  722.674727] Before fill address=0x7ffff7feffcc set_idx:0x7fef
[  722.743362] pcache:ffff8801801ffbc0 mapcount:0 refcount:1 flags:(allocated|usable) set_idx=0x7fef kva: ffff880107fef000
[  722.872312] __pcache_do_fill_page(): I pid:32 tgid:32 address:0x7ffff7feffcc flags:0x50
[  722.967985] __pcache_do_fill_page(): before net pgd ffff88207fccf000, pgd.cont_va ffff88207fcae000, pud_index=0x0 pud: ffff88207fcae000
last line

Well, the following finding finally find the bug line. And it kind of explains the above bug. Probably kmalloc’ed area has issues, so IB is touching wrong data. The following bug is related to kmalloc, the rmap is 56 bytes, and it should be within 1 single page, but it is not:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
[ 1862.307427] pcache_add_rmap() 413 pgd ffff88207fccf000, pgd.cont_va ffff88207fcae000, pud_index=0x0 pud: ffff88207fcae000
[ 1862.438477] alloc_pcache_rmap() 86 pgd ffff88207fccf000, pgd.cont_va ffff88207fcae000, pud_index=0x0 pud: ffff88207fcae000
[ 1862.570568] sp->units: 50 SLOB_UNITS: 32
[ 1862.617372] alloc_pcache_rmap(): size: 56, rmap: ffff88207fccefd0
[ 1862.690178] alloc_pcache_rmap() 97 pgd ffff88207fccf000, pgd.cont_va ffff88207fcae000, pud_index=0x0 pud: ffff88207fcae000
[ 1862.822268] alloc_pcache_rmap() 104 &rmap->next ffff88207fcceff8 &flags ffff88207fccefd8
[ 1862.918995] __INIT_LIST_HEAD(): next ffff88207fcceff8 prev ffff88207fccf000
[ 1863.002202] __INIT_LIST_HEAD() 63 pgd ffff88207fccf000, pgd.cont_va ffff88207fcae000, pud_index=0x0 pud: ffff88207fcae000
[ 1863.133253] __INIT_LIST_HEAD(): next ffff88207fcceff8 prev ffff88207fccf000
[ 1863.216459] alloc_pcache_rmap(): size: 56, rmap: ffff88207fccefd0
[ 1863.289265] alloc_pcache_rmap() 114 pgd ffff88207fccf000, pgd.cont_va ffff90207fcce000, pud_index=0x0 pud: ffff90207fcce000

Analysis: The @prev field in line 7 has address ffff88207fccf000, which happen to the pgd page (pgd ffff88207fccf000). Thus when we do list->prev = list, it writes to the first 8 bytes of pgd page, corrupts the original pgd entry. That is why we see a corrupted pgd entry (ffff90207fcce000).

This roots from kmalloc, which should not allocate such an object that cross two pages.


03/08 Thur

Took several days off. This morning finished the porting of wait4 and waitid, which actually has a lot code change. The concept and mechanism is fairly simple, but the legacy UNIX tradition make the implementation quite complex.

Now, look back to finish debugging the pcache issue. It must be fixed this week.

python

Tried python hello_world.py, the program runs for a while and crashes at a deterministic point:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
wuklab13 and wuklab15, ~/ttyS1
[419097.929969] __pcache_do_fill_page(): O pid:32 tgid:32 address:0x7ffff7a4b008 flags:0x50 ret:0(OKAY)
[419098.039145] __pcache_do_fill_page(): I pid:32 tgid:32 address:0x7ffff7a4c010 flags:0x50
[419098.306537] __pcache_do_fill_page(): O pid:32 tgid:32 address:0x7ffff7a4c010 flags:0x50 ret:0(OKAY)
[419098.413756] CPU5 PID32 sys_mprotect+0x0/0x90
[419098.465753] SYSC_mprotect() cpu(5) tsk(32/32/python) user-ip:0x7ffff7df3d27
[419098.549990]     start:0x7ffff7d8c000,len:0x2000,prot:0x1
[419098.614469] BUG: unable to handle kernel paging request at ffff9001801ff000
[419098.698703] IP: [<ffffffff8102f7a9>] pcache_handle_fault+0x69/0x6c0
[419098.774621] PGD 0
[419098.799579] Oops: 0000 [#1] SMP PROCESSOR
[419098.848457] CPU: 5 PID: 32 Comm: python 4.0.0-lego-ys+ #312
[419098.916054] RIP: 0010:[<ffffffff8102f7a9>]  [<ffffffff8102f7a9>] pcache_handle_fault+0x69/0x6c0
[419099.021089] RSP: 0000:ffff88107e857ed8  EFLAGS: 00010286
[419099.085567] RAX: ffff9001801ff000 RBX: ffff9001801ff000 RCX: 00003ffffffff000
[419099.171884] RDX: 00000801801ff000 RSI: 0000000000601008 RDI: ffff88107e83d648
[419099.258199] RBP: ffff88107e857f18 R08: 00007ffff7fe3000 R09: 00007ffff7fe3000
[419099.344516] R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000601008
[419099.430832] R13: ffff88107e83d648 R14: 0000000000000050 R15: 00007ffff7ffe150
[419099.517149] FS:  00007ffff7fdf740(0000) GS:ffff88207fc40000(0000) knlGS:0000000000000000
[419099.614905] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[419099.684582] CR2: ffff9001801ff000 CR3: 000000207fccf000 CR4: 00000000000406a0
[419099.770899] Stack:
[419099.795858] 00007ffff7d8c000 0000000000002000 0000000000000001 0000000000000004
[419099.884254] 0000000000601008 ffff88107e857f58 0000000000000000 00007ffff7ffe150
[419099.972650] ffff88107e857f48 ffffffff81010082 0000000000000000 0000000000000001
[419100.061047] 000392c29c720ba2 0000000000000000 00007fffffffdc40 ffffffff8100d91f
[419100.149442] 00007ffff7ffe150 0000000000000000 000392c29c720ba2 0000000000000001
[419100.237839] Call Trace:
[419100.267998] <TSK>
[419100.291917] [<ffffffff81010082>] do_page_fault+0xa2/0x1a0
[419100.357434] [<ffffffff8100d91f>] page_fault+0x1f/0x30
[419100.418792] <EOT>


M:
...
[419142.163396] handle_p2m_pcache_miss() cpu 4 I nid:0 pid:32 tgid:32 flags:50 vaddr:0x7ffff7a4c010
[419142.268460] handle_p2m_pcache_miss() cpu 4 O nid:0 pid:32 tgid:32 flags:50 vaddr:0x7ffff7a4c010
(Last Message)

Dig deeper:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
int pcache_handle_fault(struct mm_struct *mm,
                        unsigned long address, unsigned long flags)
{
        ..
        pgd = pgd_offset(mm, address);
        pr_info("    addr: %#lx, pgd: %p\n", address, pgd);
        pud = pud_alloc(mm, pgd, address);
        pr_info("    addr: %#lx, pgd: %p pud %p\n", address, pgd, pud);
        if (!pud)
                return VM_FAULT_OOM;
        pmd = pmd_alloc(mm, pud, address);
        if (!pmd)
..
}

[21130.503314] strace__mprotect cpu5 start=0x7ffff7d8c000, len=0x2000, prot(0x1)=PROT_READ
[21130.598994] SYSC_mprotect() cpu(5) tsk(32/32/python) user-ip:0x7ffff7df3d27
[21130.682193]     start:0x7ffff7d8c000,len:0x2000,prot:0x1
[21130.745635]     addr: 0x601008, pgd: ffff88207fccf000
[21130.805954]     addr: 0x601008, pgd: ffff88207fccf000 pud ffff9001801ff000
[21130.888116] BUG: unable to handle kernel paging request at ffff9001801ff000
[21130.971314] IP: [<ffffffff8102fa11>] pcache_handle_fault+0x91/0x6f0

Print pgd and pud info, these three messages are related and the last one leads to panic:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
wuklab13 ~/ys/0308-6
[  479.375498] addr: 0x400040, pgd: ffff88207fccf000
[  479.435819] pud_alloc_one(): addr: 0x400040, pud: ffff88207fc6f000
[  479.511739] pud_alloc(): addr: 0x400040 pgd ffff88207fccf000, pgd.cont_va ffff88207fc6f000, pud_index=0x0 pud: ffff88207fc6f000
[  479.649021] addr: 0x400040, pgd: ffff88207fccf000 pud ffff88207fc6f000

[  480.016381] addr: 0x600dd8, pgd: ffff88207fccf000
[  480.076701] pud_alloc(): addr: 0x600dd8 pgd ffff88207fccf000, pgd.cont_va ffff88207fc6f000, pud_index=0x0 pud: ffff88207fc6f000
[  480.213982] addr: 0x600dd8, pgd: ffff88207fccf000 pud ffff88207fc6f000

[  680.072819] addr: 0x601008, pgd: ffff88207fccf000
[  680.133138] pud_alloc(): addr: 0x601008 pgd ffff88207fccf000, pgd.cont_va ffff90107e834000, pud_index=0x0 pud: ffff90107e834000
[  680.270422] addr: 0x601008, pgd: ffff88207fccf000 pud ffff90107e834000

[  680.352583] BUG: unable to handle kernel paging request at ffff90107e834000
[  680.435783] IP: [<ffffffff8102fc43>] pcache_handle_fault+0xb3/0x770
[  680.510664] PGD 0

I need to check what happens between 480s to 680s. Something in between corrupted pgtable. I doubt it can be:

  • copy_to_user related syscalls
  • pcache establish mapping, mempcy
  • all other memcpy strcpy etc stuff

03/02 Fri

TODO:

  • -add vsyscall-
  • -pcache_exit_process: free rmap, free cacheline, etc. When rmap is NULL, we clearly should free this pcache.-
  • pcache_exit_thread? I don’t think we need this. All pcache related activities should relate to mm, or thread group leader, not one particular thread.
  • check python bug
  • use omnigraffle to draw the whole workflow of pcache.

Phoenix, word_count-seq, 4G dataset, 4GB pcache:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
[  273.268853] Processor: Processor manager is running.
[  573.272479] page:ffffea0071bb9660 count:0 mapcount:-128
[  573.332903] flags: 0x200000000000300(slab|slob_free)
[  573.392182] page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
[  573.474340] ------------[ cut here ]------------
[  573.529459] BUG: failure at ./include/lego/mm.h:251/put_page_testzero()!
[  573.609537] Kernel Panic - not syncing: BUG!
[  573.660496] CPU: 4 PID: 13 Comm: kvictim_flushd 4.0.0-lego+ #18
[  573.731212] Stack:
[  573.755132] ffff88207e4bfe10 ffffffff81023644 0000000000000008 ffff88207e4bfe20
[  573.842490] ffff88207e4bfdd8 0000000021475542 0000000000000000 0000000000000000
[  573.929848] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  574.017205] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  574.104563] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  574.191921] Call Trace:
[  574.221039] <TSK>
[  574.243919] [<ffffffff81023650>] panic+0xc2/0xeb
[  574.299038] [<ffffffff8105a35a>] ? client_internal_poll_sendcq+0x2a/0x80
[  574.379115] [<ffffffff8105a4fd>] ? client_send_message_with_rdma_write_with_imm_request+0x14d/0x360
[  574.487273] [<ffffffff8101ac3c>] ? task_tick_rt+0x2c/0xd0
[  574.551751] [<ffffffff81018395>] ? scheduler_tick+0x55/0x60
[  574.618308] [<ffffffff81015a45>] ? tick_handle_periodic+0x45/0x70
[  574.691107] [<ffffffff810064c4>] ? apic_timer_interrupt+0x54/0x90
[  574.763905] [<ffffffff8100dbaa>] ? smp__apic_timer_interrupt+0x6a/0x70
[  574.841903] [<ffffffff8101198d>] ? printk+0x11d/0x1b0
[  574.902222] [<ffffffff81025c00>] __free_pages+0x2e0/0x3c0
[  574.966699] [<ffffffff81028472>] kfree+0x62/0x480
[  575.022858] [<ffffffff8102e6be>] victim_flush_func+0x15e/0x1e0
[  575.092536] [<ffffffff8102e560>] ? victim_try_fill_pcache+0x390/0x390
[  575.169494] [<ffffffff8101e446>] kthread+0xf6/0x120
[  575.227733] [<ffffffff8101e350>] ? __kthread_parkme+0x70/0x70
[  575.296371] [<ffffffff8100de32>] ret_from_fork+0x22/0x30
[  575.359810] <EOT>


03/01 Thur

Weird.

1
2
3
4
5
6
7
[43181.388400] p2m_fork(cpu5): I cur:24-word_count-seq new:25
[43181.435341] p2m_fork(cpu5): O succeed cur:24-word_count-seq new:25
[43181.436013] __pcache_do_fill_page(): I pid:24 tgid:24 address:0x4158d0 flags:0x150
[43181.439246] __pcache_do_fill_page(): O pid:24 tgid:24 address:0x4158d0 flags:0x150 ret:0(OKAY) csum:0x9e8f028e

[43181.510534] __pcache_do_fill_page(): I pid:25 tgid:25 address:0x415000 flags:0x150
[43181.517729] __pcache_do_fill_page(): O pid:25 tgid:25 address:0x415000 flags:0x150 ret:0(OKAY) csum:0xffff88029e8f028e

After all, it is TLB issue. I forgot to flush tlb after making the original pte read-only during fork. So the parent will be also to continue RW some pages, which should be process-private.

Lego’s current TLB flush is very native, we do tlbflush after each pte changes. This will have worse performance compared to linux’s batch flush.

Today’s case is flush tlb after making pte read-only. And this really has to be performed one by one