Full GC happening during rebalancing

classic Classic list List threaded Threaded
3 messages Options
userx userx
Reply | Threaded
Open this post in threaded view
|

Full GC happening during rebalancing

Hi

I am running Ignite data grid in persistent mode with following
configuration.
20 clients
3 ignite servers (S1,S2,S3) of Xmx 5G started with 2.6.0
I have set WAL and WALArchiving to the same path so that the WAL Archiving
is disabled.
For TcpCommunicationSpi
<property name="usePairedConnections" value="true"></property>
<property name="messageQueueLimit" value="1024"></property>

After 15-20 minutes of run (by that time all clients get connected), S2 goes
down owing to full GC. I have repeated this experiment 3 times and everytime
either of the three servers go down. I am taking heap dump but it is dead
slow. Please look at the time around 10:15.

But I am pasting the ignite server logs for S2
2019-01-03 10:14:49,961 INFO [exchange-worker-#122%ABCD2%] {}
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander
- Starting rebalancing [grp=ProductCache, mode=ASYNC,
fromNode=c861aadf-692c-439b-a27c-5c0c0bc25cc2, partitionsCount=246,
topology=AffinityTopologyVersion [topVer=21, minorTopVer=2],
rebalanceId=281]
2019-01-03 10:14:50,078 INFO [grid-nio-worker-tcp-comm-10-#99%ABCD2%] {}
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi - Accepted
incoming communication connection [locAddr=/127.0.0.1:47100,
rmtAddr=/127.0.0.1:59476]
2019-01-03 10:14:53,948 INFO [db-checkpoint-thread-#123%ABCD2%] {}
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager
- Checkpoint finished [cpId=e8b1e7a7-ab78-49d4-8dda-9752fc91d92d,
pages=343796, markPos=FileWALPointer [idx=49, fileOff=58182175, len=14930],
walSegmentsCleared=0, markDuration=883ms, pagesWrite=3802ms, fsync=5036ms,
total=9922ms]
2019-01-03 10:14:59,241 INFO [grid-nio-worker-tcp-comm-12-#101%ABCD2%] {}
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi - Accepted
incoming communication connection [locAddr=/127.0.0.1:47100,
rmtAddr=/127.0.0.1:59508]
2019-01-03 10:14:59,442 INFO [grid-nio-worker-tcp-comm-14-#103%ABCD2%] {}
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi - Accepted
incoming communication connection [locAddr=/127.0.0.1:47100,
rmtAddr=/127.0.0.1:59510]
2019-01-03 10:15:00,242 ERROR [grid-nio-worker-tcp-comm-12-#101%ABCD2%] {}
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi - Failed to
process selector key [ses=GridSelectorNioSessionImpl
[worker=DirectNioClientWorker [super=AbstractNioClientWorker [idx=12,
bytesRcvd=1184852821, bytesSent=2534, bytesRcvd0=42, bytesSent0=38,
select=true, super=GridWorker [name=grid-nio-worker-tcp-comm-12,
igniteInstanceName=ABCD2, finished=false, hashCode=1249811755,
interrupted=false, runner=grid-nio-worker-tcp-comm-12-#101%ABCD2%]]],
writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
inRecovery=null, outRecovery=null, super=GridNioSessionImpl
[locAddr=/127.0.0.1:47100, rmtAddr=/127.0.0.1:59508,
createTime=1546510499232, closeTime=0, bytesSent=38, bytesRcvd=42,
bytesSent0=38, bytesRcvd0=42, sndSchedTime=1546510499232,
lastSndTime=1546510499232, lastRcvTime=1546510499232, readsPaused=false,
filterChain=FilterChain[filters=[GridNioCodecFilter
[parser=o.a.i.i.util.nio.GridDirectParser@70f888ac, directMode=true],
GridConnectionBytesVerifyFilter], accepted=true]]] java.io.IOException:
Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:192)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
        at
org.apache.ignite.internal.util.nio.GridNioServer$DirectNioClientWorker.processRead(GridNioServer.java:1250)
        at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeysOptimized(GridNioServer.java:2339)
        at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2110)
        at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1764)
        at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
        at java.lang.Thread.run(Thread.java:748)

2019-01-03 10:15:00,242 WARN [grid-nio-worker-tcp-comm-12-#101%ABCD2%] {}
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi - Closing NIO
session because of unhandled exception [cls=class
o.a.i.i.util.nio.GridNioException, msg=Connection reset by peer]
2019-01-03 10:15:03,906 INFO [grid-nio-worker-tcp-comm-16-#105%ABCD2%] {}
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi - Accepted
incoming communication connection [locAddr=/10.63.216.22:47100,
rmtAddr=/10.63.216.50:56876]
2019-01-03 10:15:04,109 INFO [grid-nio-worker-tcp-comm-18-#107%ABCD2%] {}
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi - Accepted
incoming communication connection [locAddr=/10.63.216.22:47100,
rmtAddr=/10.63.216.50:56878]
2019-01-03 10:15:04,908 ERROR [grid-nio-worker-tcp-comm-16-#105%ABCD2%] {}
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi - Failed to
process selector key [ses=GridSelectorNioSessionImpl
[worker=DirectNioClientWorker [super=AbstractNioClientWorker [idx=16,
bytesRcvd=1257163869, bytesSent=2484, bytesRcvd0=42, bytesSent0=38,
select=true, super=GridWorker [name=grid-nio-worker-tcp-comm-16,
igniteInstanceName=ABCD2, finished=false, hashCode=1355017133,
interrupted=false, runner=grid-nio-worker-tcp-comm-16-#105%ABCD2%]]],
writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
inRecovery=null, outRecovery=null, super=GridNioSessionImpl
[locAddr=/10.63.216.22:47100, rmtAddr=/10.63.216.50:56876,
createTime=1546510503900, closeTime=0, bytesSent=38, bytesRcvd=42,
bytesSent0=38, bytesRcvd0=42, sndSchedTime=1546510503900,
lastSndTime=1546510503900, lastRcvTime=1546510503900, readsPaused=false,
filterChain=FilterChain[filters=[GridNioCodecFilter
[parser=o.a.i.i.util.nio.GridDirectParser@70f888ac, directMode=true],
GridConnectionBytesVerifyFilter], accepted=true]]] java.io.IOException:
Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:192)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
        at
org.apache.ignite.internal.util.nio.GridNioServer$DirectNioClientWorker.processRead(GridNioServer.java:1250)
        at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeysOptimized(GridNioServer.java:2339)
        at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2110)
        at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1764)
        at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
        at java.lang.Thread.run(Thread.java:748)

2019-01-03 10:15:04,908 WARN [grid-nio-worker-tcp-comm-16-#105%ABCD2%] {}
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi - Closing NIO
session because of unhandled exception [cls=class
o.a.i.i.util.nio.GridNioException, msg=Connection reset by peer]
2019-01-03 10:15:06,428 INFO [grid-nio-worker-tcp-comm-0-#89%ABCD2%] {}
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi - Accepted
incoming communication connection [locAddr=/127.0.0.1:47100,
rmtAddr=/127.0.0.1:59518]
2019-01-03 10:15:06,630 INFO [grid-nio-worker-tcp-comm-2-#91%ABCD2%] {}
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi - Accepted
incoming communication connection [locAddr=/127.0.0.1:47100,
rmtAddr=/127.0.0.1:59522]
2019-01-03 10:15:07,429 ERROR [grid-nio-worker-tcp-comm-0-#89%ABCD2%] {}
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi - Failed to
process selector key [ses=GridSelectorNioSessionImpl
[worker=DirectNioClientWorker [super=AbstractNioClientWorker [idx=0,
bytesRcvd=949538335, bytesSent=2282, bytesRcvd0=927948, bytesSent0=38,
select=true, super=GridWorker [name=grid-nio-worker-tcp-comm-0,
igniteInstanPattern not found  (press RETURN)


GC Logs for S2 (excerpts)
2019-01-03T10:14:58.873+0000: 1544.657: [GC pause (G1 Evacuation Pause)
(mixed), 0.0352468 secs]
   [Parallel Time: 29.1 ms, GC Workers: 28]
      [GC Worker Start (ms): Min: 1544657.8, Avg: 1544658.3, Max: 1544658.8,
Diff: 1.0]
      [Ext Root Scanning (ms): Min: 0.0, Avg: 0.4, Max: 1.2, Diff: 1.2, Sum:
12.5]
      [Update RS (ms): Min: 7.9, Avg: 8.4, Max: 11.3, Diff: 3.4, Sum: 235.8]
         [Processed Buffers: Min: 1, Avg: 10.8, Max: 20, Diff: 19, Sum: 303]
      [Scan RS (ms): Min: 0.0, Avg: 1.3, Max: 1.8, Diff: 1.8, Sum: 36.8]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1,
Sum: 0.6]
      [Object Copy (ms): Min: 15.2, Avg: 16.8, Max: 18.1, Diff: 2.8, Sum:
471.2]
      [Termination (ms): Min: 0.0, Avg: 1.3, Max: 1.3, Diff: 1.3, Sum: 35.4]
         [Termination Attempts: Min: 1, Avg: 1.4, Max: 3, Diff: 2, Sum: 39]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.2, Diff: 0.1, Sum:
1.6]
      [GC Worker Total (ms): Min: 27.8, Avg: 28.4, Max: 28.8, Diff: 1.0,
Sum: 793.9]
      [GC Worker End (ms): Min: 1544686.6, Avg: 1544686.7, Max: 1544686.7,
Diff: 0.1]
   [Code Root Fixup: 0.0 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 1.2 ms]
   [Other: 4.8 ms]
      [Choose CSet: 0.4 ms]
      [Ref Proc: 0.7 ms]
      [Ref Enq: 0.0 ms]
      [Redirty Cards: 0.5 ms]
      [Humongous Register: 0.2 ms]
      [Humongous Reclaim: 0.1 ms]
      [Free CSet: 2.0 ms]
   [Eden: 224.0M(224.0M)->0.0B(2215.0M) Survivors: 32.0M->32.0M Heap:
1942.5M(5120.0M)->1609.5M(5120.0M)]
 [Times: user=0.79 sys=0.01, real=0.04 secs]
2019-01-03T10:15:10.173+0000: 1555.956: [GC pause (G1 Evacuation Pause)
(young) (to-space exhausted), 1.2771571 secs]
   [Parallel Time: 1070.4 ms, GC Workers: 28]
      [GC Worker Start (ms): Min: 1555957.8, Avg: 1555958.1, Max: 1555958.7,
Diff: 0.9]
      [Ext Root Scanning (ms): Min: 0.0, Avg: 0.5, Max: 1.1, Diff: 1.1, Sum:
13.9]
      [Update RS (ms): Min: 3.1, Avg: 12.4, Max: 20.2, Diff: 17.1, Sum:
346.2]
         [Processed Buffers: Min: 8, Avg: 25.0, Max: 43, Diff: 35, Sum: 701]
      [Scan RS (ms): Min: 0.5, Avg: 1.7, Max: 7.9, Diff: 7.4, Sum: 48.5]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
Sum: 0.2]
      [Object Copy (ms): Min: 1040.3, Avg: 1051.5, Max: 1062.3, Diff: 22.0,
Sum: 29441.9]
      [Termination (ms): Min: 0.0, Avg: 3.2, Max: 10.6, Diff: 10.6, Sum:
89.7]
         [Termination Attempts: Min: 1, Avg: 11.1, Max: 27, Diff: 26, Sum:
310]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.2, Max: 0.4, Diff: 0.4, Sum:
6.5]
      [GC Worker Total (ms): Min: 1068.8, Avg: 1069.5, Max: 1070.0, Diff:
1.2, Sum: 29946.8]
      [GC Worker End (ms): Min: 1557027.4, Avg: 1557027.6, Max: 1557027.8,
Diff: 0.4]
   [Code Root Fixup: 0.0 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 1.5 ms]
   [Other: 205.2 ms]
      [Evacuation Failure: 197.5 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 1.2 ms]
      [Ref Enq: 0.0 ms]
      [Redirty Cards: 1.4 ms]
      [Humongous Register: 0.3 ms]
      [Humongous Reclaim: 0.2 ms]
      [Free CSet: 3.2 ms]
         
          ****************************************
          2019-01-03T10:22:00.265+0000: 1966.049: [GC pause (G1 Evacuation Pause)
(young) (initial-mark), 0.0210944 secs]
   [Parallel Time: 12.1 ms, GC Workers: 28]
      [GC Worker Start (ms): Min: 1966050.3, Avg: 1966050.8, Max: 1966051.4,
Diff: 1.1]
      [Ext Root Scanning (ms): Min: 1.3, Avg: 4.7, Max: 10.4, Diff: 9.1,
Sum: 132.3]
      [Update RS (ms): Min: 0.0, Avg: 0.1, Max: 0.7, Diff: 0.7, Sum: 2.5]
         [Processed Buffers: Min: 0, Avg: 1.2, Max: 5, Diff: 5, Sum: 34]
      [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
Sum: 0.0]
      [Object Copy (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 1.2]
      [Termination (ms): Min: 0.0, Avg: 6.2, Max: 9.6, Diff: 9.6, Sum:
172.4]
         [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 28]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum:
0.4]
      [GC Worker Total (ms): Min: 10.4, Avg: 11.0, Max: 11.6, Diff: 1.1,
Sum: 308.8]
      [GC Worker End (ms): Min: 1966061.8, Avg: 1966061.8, Max: 1966062.2,
Diff: 0.4]
   [Code Root Fixup: 0.0 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.5 ms]
   [Other: 8.5 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 5.2 ms]
      [Ref Enq: 0.0 ms]
      [Redirty Cards: 0.6 ms]
      [Humongous Register: 0.2 ms]
      [Humongous Reclaim: 0.5 ms]
      [Free CSet: 0.2 ms]
   [Eden: 0.0B(256.0M)->0.0B(256.0M) Survivors: 0.0B->0.0B Heap:
5075.6M(5120.0M)->5075.6M(5120.0M)]
 [Times: user=0.22 sys=0.00, real=0.02 secs]
2019-01-03T10:22:00.287+0000: 1966.070: [GC
concurrent-root-region-scan-start]
2019-01-03T10:22:00.287+0000: 1966.070: [GC concurrent-root-region-scan-end,
0.0000520 secs]
2019-01-03T10:22:00.287+0000: 1966.070: [GC concurrent-mark-start]
2019-01-03T10:22:00.288+0000: 1966.072: [Full GC (Allocation Failure)
(END)

Can someone help me on the same ? All three times the re-balancing is
started
2019-01-03 10:14:49,961 INFO [exchange-worker-#122%ABCD2%] {}
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander
- Starting rebalancing [grp=ProductCache, mode=ASYNC,
fromNode=c861aadf-692c-439b-a27c-5c0c0bc25cc2, partitionsCount=246,
topology=AffinityTopologyVersion [topVer=21, minorTopVer=2],
rebalanceId=281]






--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
userx userx
Reply | Threaded
Open this post in threaded view
|

Re: Full GC happening during rebalancing

Hi,

I have not been able to get the heap dump after oom (it is not opening from
mat or jhat) but I am repeating my experiment and just took the heap dump
when one of the ignite servers is at 2 gb out of 5 gb.

This is what the mat extract tells me

80 instances of
"org.apache.ignite.internal.util.StripedExecutor$StripeConcurrentQueue",
loaded by "sun.misc.Launcher$AppClassLoader @ 0x680600820" occupy
592,456,488 (62.80%) bytes.

for 3 ignite servers, is it fine ? Can some one help me how to control the
stripes per node and StipedExecutor ?



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Full GC happening during rebalancing

Hello!

You can decrease number of stripes but it will also hamper your performance. Moreover maybe queue will just be redistributed.

Have you tried decreasing rebalance throughput?

Regards,
--
Ilya Kasnacheev


пт, 4 янв. 2019 г. в 08:36, userx <[hidden email]>:
Hi,

I have not been able to get the heap dump after oom (it is not opening from
mat or jhat) but I am repeating my experiment and just took the heap dump
when one of the ignite servers is at 2 gb out of 5 gb.

This is what the mat extract tells me

80 instances of
"org.apache.ignite.internal.util.StripedExecutor$StripeConcurrentQueue",
loaded by "sun.misc.Launcher$AppClassLoader @ 0x680600820" occupy
592,456,488 (62.80%) bytes.

for 3 ignite servers, is it fine ? Can some one help me how to control the
stripes per node and StipedExecutor ?



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/