Ignite instances frequently failing - BUG: soft lockup - CPU#1 stuck

classic Classic list List threaded Threaded
3 messages Options
bbellrose bbellrose
Reply | Threaded
Open this post in threaded view
|

Ignite instances frequently failing - BUG: soft lockup - CPU#1 stuck

This post was updated on .
Ignite instances keep failing. Server indicates CPU stuck. However monitoring
shows very little CPU usage. This happens almost every day on different
nodes of the cluster.

<http://apache-ignite-users.70518.x6.nabble.com/file/t3004/cpu.jpg

Oct 29 17:48:19 ...
 kernel:watchdog: BUG: soft lockup - CPU#1 stuck for 38s! [C2 CompilerThre:4000759]
Oct 29 17:48:19 nalrcsvridbq02 kernel: watchdog: BUG: soft lockup - CPU#1 stuck for 38s! [C2 CompilerThre:4000759]
Oct 29 17:48:19 nalrcsvridbq02 kernel: Modules linked in: binfmt_misc nf_tables nfnetlink vmw_vsock_vmci_transport vsock ext4 mbcache jbd2 intel_rapl_msr intel_rapl_common nfit libnvdimm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel vmw_balloon intel_rapl_perf pcspkr joydev i2c_piix4 vmw_vmci auth_rpcgss sunrpc ip_tables xfs libcrc32c sr_mod cdrom vmwgfx ata_generic drm_kms_helper sd_mod sg syscopyarea sysfillrect sysimgblt fb_sys_fops ttm crc32c_intel ahci drm libahci ata_piix serio_raw libata vmxnet3 vmw_pvscsi dm_mirror dm_region_hash dm_log dm_mod fuse
Oct 29 17:48:19 nalrcsvridbq02 kernel: CPU: 1 PID: 4000759 Comm: C2 CompilerThre Kdump: loaded Tainted: G             L   --------- -  - 4.18.0-193.19.1.el8_2.x86_64 #1
Oct 29 17:48:19 nalrcsvridbq02 kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/03/2018
Oct 29 17:48:19 nalrcsvridbq02 kernel: RIP: 0033:0x7f5ba3875c65
Oct 29 17:48:19 nalrcsvridbq02 kernel: Code: 70 ff ff ff 4c 8b 75 80 eb 11 0f 1f 00 41 83 ed 01 41 83 fd ff 0f 84 8a 00 00 00 49 8b 54 24 10 44 89 e8 48 8b 1c c2 48 8b 03 <48> 89 df ff 50 10 84 c0 74 d9 48 85 db 0f 84 f0 00 00 00 8b 4b 28
Oct 29 17:48:19 nalrcsvridbq02 kernel: RSP: 002b:00007f5b8c380ca0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
Oct 29 17:48:19 nalrcsvridbq02 kernel: RAX: 00007f5ba4300840 RBX: 00007f5b58c86840 RCX: 00007f5b59a29200
Oct 29 17:48:19 nalrcsvridbq02 kernel: RDX: 00007f5b58c86998 RSI: 0000000000002080 RDI: 00007f5b4cc665c0
Oct 29 17:48:19 nalrcsvridbq02 kernel: RBP: 00007f5b8c380d30 R08: 0000000000000000 R09: 00007f5b4c4fb8e0
Oct 29 17:48:19 nalrcsvridbq02 kernel: R10: 0000000000000008 R11: 0000000000008d0e R12: 00007f5b58c86140
Oct 29 17:48:19 nalrcsvridbq02 kernel: R13: 0000000000000007 R14: 00007f5b8c382730 R15: 00007f5b8c380dc0
Oct 29 17:48:19 nalrcsvridbq02 kernel: FS:  00007f5b8c385700 GS:  0000000000000000
Oct 29 17:48:19 nalrcsvridbq02 agent[3978030]: 2020-10-29 17:48:19 EDT | CORE | ERROR | (pkg/forwarder/worker.go:178 in process) | Error while processing transaction: error while sending transaction, rescheduling it: Post "https://7-23-0-app.agent.datadoghq.com/api/v1/series?api_key=*************************44602": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Oct 29 17:48:19 nalrcsvridbq02 Ignite[4000681]: [17:48:19] Ignite node stopped OK [name=RailConnect Ignite QA Grid, uptime=08:52:18.074]


--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
aealexsandrov aealexsandrov
Reply | Threaded
Open this post in threaded view
|

Re: Ignite instances frequently failing - BUG: soft lockup - CPU#1 stuck

Hello,

Too little information has been provided on your part:

1) Could you provide the screenshot from the web console at this time?
2) Could you collect Ignite logs during this period?
3) What tool shows that the processors are frozen? Have you checked
other tools?

BR,
Andrew

10/30/2020 3:07 PM, bbellrose пишет:

> Ignite instances keep failing. Server indicates CPU stuck. However monitoring
> shows very little CPU usage. This happens almost every day on different
> nodes of the cluster.
>
> <http://apache-ignite-users.70518.x6.nabble.com/file/t3004/cpu.jpg>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
bbellrose bbellrose
Reply | Threaded
Open this post in threaded view
|

Re: Ignite instances frequently failing - BUG: soft lockup - CPU#1 stuck

Looks like it was a centos 8 bug with ksmtuned. Had a few VMs going crazy
with cpu for that process. I have disabled that service and CPU on the VM
cluster is down. I am going to wait to see if that resolves it.

Brian



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/