Ignite Load Hangs at the end while using DataStreamer

classic Classic list List threaded Threaded
7 messages Options
pg31 pg31
Reply | Threaded
Open this post in threaded view
|

Ignite Load Hangs at the end while using DataStreamer

Hi

Cluster Configuration:
3 Nodes (112 GB Memory / 512 GB Disk)

Ignite Configuration:
1. Persistence Enabled
2. Version: 2.6.0
3. Configuration is as follows:
            <property name="authenticationEnabled" value="true"/>
            <property name="failureDetectionTimeout" value="30000"/>
            <property name="workDirectory" value="/persistence/work"/>            
            <property name="dataStorageConfiguration">
                <bean
class="org.apache.ignite.configuration.DataStorageConfiguration">                    
                    <property name="defaultDataRegionConfiguration">
                        <bean
class="org.apache.ignite.configuration.DataRegionConfiguration">                            
                            <property name="name" value="Default_Region"/>
                            <property name="persistenceEnabled"
value="true"/>
                            <property name="maxSize" value="#{80 * 1024 *
1024 * 1024}"/>
                        </bean>
                    </property>
                    <property name="storagePath" value="/persistence"/>
                    <property name="walPath" value="/wal"/>
                    <property name="walArchivePath" value="/wal/archive"/>
                    <property name="walMode" value="LOG_ONLY"/>
                    <property name="walCompactionEnabled" value="true" />
                    <property name="walHistorySize" value="2" />
                </bean>
            </property>


Issue: Data Loading with Spark gets stuck at the end.
Description:
I am trying to load 65M (million) rows into the Ignite cluster. Everything
runs well till 64.5 Million rows and then all of sudden, the data ingestion
just hangs. (I am able to ingest 64.5 Million rows in about 10 minutes)

There is enough memory which is still free on all the nodes (Approximately
80 GB of memory remains free on all the nodes).

I am using the following code to ingest data into ignite.

dataFrame
      .write.format(IgniteDataFrameSettings.FORMAT_IGNITE)
      .option(IgniteDataFrameSettings.OPTION_TABLE, tableName)
      .option(IgniteDataFrameSettings.OPTION_CONFIG_FILE, igniteConfig)
     
.option(IgniteDataFrameSettings.OPTION_CREATE_TABLE_PRIMARY_KEY_FIELDS,
primaryKey)
      .option("user", igniteUsername)
      .option("password", ignitePassword)
      .mode(SaveMode.Overwrite) //Overwriting entire table.
      .save()

I am not sure, but it looks like that the Ignite Thread is hanging
somewhere.

This is what I can see in the thread dumps of all the executors.


<http://apache-ignite-users.70518.x6.nabble.com/file/t2770/Spark_Thread_Stuck.png>



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
dmagda dmagda
Reply | Threaded
Open this post in threaded view
|

Re: Ignite Load Hangs at the end while using DataStreamer

Please attach all the thread dumps and log files for the investigation.

-
Denis


On Wed, Feb 5, 2020 at 8:03 AM pg31 <[hidden email]> wrote:
Hi

Cluster Configuration:
3 Nodes (112 GB Memory / 512 GB Disk)

Ignite Configuration:
1. Persistence Enabled
2. Version: 2.6.0
3. Configuration is as follows:
            <property name="authenticationEnabled" value="true"/>
            <property name="failureDetectionTimeout" value="30000"/>
            <property name="workDirectory" value="/persistence/work"/>           
            <property name="dataStorageConfiguration">
                <bean
class="org.apache.ignite.configuration.DataStorageConfiguration">                   
                    <property name="defaultDataRegionConfiguration">
                        <bean
class="org.apache.ignite.configuration.DataRegionConfiguration">                           
                            <property name="name" value="Default_Region"/>
                            <property name="persistenceEnabled"
value="true"/>                                         
                            <property name="maxSize" value="#{80 * 1024 *
1024 * 1024}"/>
                        </bean>
                    </property>
                    <property name="storagePath" value="/persistence"/>
                    <property name="walPath" value="/wal"/>
                    <property name="walArchivePath" value="/wal/archive"/>
                    <property name="walMode" value="LOG_ONLY"/>
                    <property name="walCompactionEnabled" value="true" />
                    <property name="walHistorySize" value="2" />
                </bean>
            </property>


Issue: Data Loading with Spark gets stuck at the end.
Description:
I am trying to load 65M (million) rows into the Ignite cluster. Everything
runs well till 64.5 Million rows and then all of sudden, the data ingestion
just hangs. (I am able to ingest 64.5 Million rows in about 10 minutes)

There is enough memory which is still free on all the nodes (Approximately
80 GB of memory remains free on all the nodes).

I am using the following code to ingest data into ignite.

dataFrame
      .write.format(IgniteDataFrameSettings.FORMAT_IGNITE)
      .option(IgniteDataFrameSettings.OPTION_TABLE, tableName)
      .option(IgniteDataFrameSettings.OPTION_CONFIG_FILE, igniteConfig)

.option(IgniteDataFrameSettings.OPTION_CREATE_TABLE_PRIMARY_KEY_FIELDS,
primaryKey)
      .option("user", igniteUsername)
      .option("password", ignitePassword)
      .mode(SaveMode.Overwrite) //Overwriting entire table.
      .save()

I am not sure, but it looks like that the Ignite Thread is hanging
somewhere.

This is what I can see in the thread dumps of all the executors.


<http://apache-ignite-users.70518.x6.nabble.com/file/t2770/Spark_Thread_Stuck.png>



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
pg31 pg31
Reply | Threaded
Open this post in threaded view
|

Re: Ignite Load Hangs at the end while using DataStreamer

Hi Denis

Thank you for the response.

Please find the thread-dumps (Had to split it in 3 files since it was large)
1. T1.pdf <http://apache-ignite-users.70518.x6.nabble.com/file/t2770/T1.pdf>  
2. T2.pdf <http://apache-ignite-users.70518.x6.nabble.com/file/t2770/T2.pdf>  
3. T3.pdf <http://apache-ignite-users.70518.x6.nabble.com/file/t2770/T3.pdf>  


Please find the logs attached. (Zipped the file, since it was large)
Logs.zip
<http://apache-ignite-users.70518.x6.nabble.com/file/t2770/Logs.zip>  



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
pg31 pg31
Reply | Threaded
Open this post in threaded view
|

Re: Ignite Load Hangs at the end while using DataStreamer

In reply to this post by pg31
Hi

I thought I was hitting: (https://issues.apache.org/jira/browse/IGNITE-8564)
which got fixed in 2.7.
I changed the version to 2.7.6 and retried, but it still does not work.

Here are the logs for 2.7.6
1. Logs.zip
<http://apache-ignite-users.70518.x6.nabble.com/file/t2770/Logs.zip>  

Here is the thread-dump (Because of file constraint, had to break it up)
1.  T11.pdf
<http://apache-ignite-users.70518.x6.nabble.com/file/t2770/T11.pdf>  
2.  T12.pdf
<http://apache-ignite-users.70518.x6.nabble.com/file/t2770/T12.pdf>  
3.  T13.pdf
<http://apache-ignite-users.70518.x6.nabble.com/file/t2770/T13.pdf>  

Please let me know if anything else is required.



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Ignite Load Hangs at the end while using DataStreamer

Hello!

I think your problem as follows:

[15:09:53,418][WARNING][disco-event-worker-#37][GridDiscoveryManager] Local node's value of 'java.net.preferIPv4Stack' system property differs from remote node's (all nodes in topology should have identical value) [locPreferIpV4=true, rmtPreferIpV4=null, locId8=7e2a36bf, rmtId8=c311b665, rmtAddrs=[<a href="http://10.139.64.39/0:0:0:0:0:0:0:1%lo">10.139.64.39/0:0:0:0:0:0:0:1%lo, /10.139.64.39, /127.0.0.1], rmtNode=ClusterNode [id=c311b665-2392-4542-b98e-f599242c2946, order=20, addr=[0:0:0:0:0:0:0:1%lo, 10.139.64.39, 127.0.0.1], daemon=false]]

Caused by: java.lang.NullPointerException
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.isLocalNodeAddress(TcpCommunicationSpi.java:3498)
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3243)
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2987)
at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2870)
... 17 more

I.e. some nodes are advertising IPv6 addresses when other nodes can't use them at all due to JVM setting. Please set this system property to 'true' for all your JVMs.

I also think this is related to https://issues.apache.org/jira/browse/IGNITE-8343 which is regrettably not fixed.

Regards,
--
Ilya Kasnacheev


чт, 6 февр. 2020 г. в 18:48, pg31 <[hidden email]>:
Hi

I thought I was hitting: (https://issues.apache.org/jira/browse/IGNITE-8564)
which got fixed in 2.7.
I changed the version to 2.7.6 and retried, but it still does not work.

Here are the logs for 2.7.6
1. Logs.zip
<http://apache-ignite-users.70518.x6.nabble.com/file/t2770/Logs.zip

Here is the thread-dump (Because of file constraint, had to break it up)
1.  T11.pdf
<http://apache-ignite-users.70518.x6.nabble.com/file/t2770/T11.pdf
2.  T12.pdf
<http://apache-ignite-users.70518.x6.nabble.com/file/t2770/T12.pdf
3.  T13.pdf
<http://apache-ignite-users.70518.x6.nabble.com/file/t2770/T13.pdf

Please let me know if anything else is required.



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
pg31 pg31
Reply | Threaded
Open this post in threaded view
|

Re: Ignite Load Hangs at the end while using DataStreamer

Thanks Ilya.

I have changed the Client Side Machine to prefer IPv4 Stack and hence that
error went away. But still the data-streamer-stripes and tcp-comm-worker
threads keep getting stuck.

I am attaching the logs again. (These contain the thread-dump themselves)
log.zip <http://apache-ignite-users.70518.x6.nabble.com/file/t2770/log.zip>  



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Ignite Load Hangs at the end while using DataStreamer

Hello!

I can see that some data processing is happening in thread dumps, but also this:

[11:16:11,637][INFO][grid-nio-worker-tcp-comm-2-#26][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/172.16.1.7:47100, rmtAddr=/10.139.0.10:38624]
[11:16:12,686][SEVERE][grid-nio-worker-tcp-comm-2-#26][TcpCommunicationSpi] Failed to process selector key [ses=GridSelectorNioSessionImpl [worker=DirectNioClientWorker [super=AbstractNioClientWorker [idx=2, bytesRcvd=430031923, bytesSent=2154539, bytesRcvd0=6974058, bytesSent0=1976, select=true, super=GridWorker [name=grid-nio-worker-tcp-comm-2, igniteInstanceName=null, finished=false, heartbeatTs=1581074171663, hashCode=1764437028, interrupted=false, runner=grid-nio-worker-tcp-comm-2-#26]]], writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768], readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768], inRecovery=GridNioRecoveryDescriptor [acked=384, resendCnt=0, rcvCnt=422, sentCnt=413, reserved=true, lastAck=416, nodeLeft=false, node=TcpDiscoveryNode [id=a66a573a-43dc-48d2-8ee5-232e727acbc9, addrs=[10.139.64.10, 127.0.0.1], sockAddrs=[/10.139.64.10:0, /127.0.0.1:0], discPort=0, order=19, intOrder=19, lastExchangeTime=1581073961809, loc=false, ver=2.7.6#20190911-sha1:21f7ca41, isClient=true], connected=true, connectCnt=0, queueLimit=4096, reserveCnt=1, pairedConnections=false], outRecovery=GridNioRecoveryDescriptor [acked=384, resendCnt=0, rcvCnt=422, sentCnt=413, reserved=true, lastAck=416, nodeLeft=false, node=TcpDiscoveryNode [id=a66a573a-43dc-48d2-8ee5-232e727acbc9, addrs=[10.139.64.10, 127.0.0.1], sockAddrs=[/10.139.64.10:0, /127.0.0.1:0], discPort=0, order=19, intOrder=19, lastExchangeTime=1581073961809, loc=false, ver=2.7.6#20190911-sha1:21f7ca41, isClient=true], connected=true, connectCnt=0, queueLimit=4096, reserveCnt=1, pairedConnections=false], super=GridNioSessionImpl [locAddr=/172.16.1.7:47100, rmtAddr=/10.139.0.10:37846, createTime=1581073963095, closeTime=0, bytesSent=78611, bytesRcvd=104294928, bytesSent0=561, bytesRcvd0=916098, sndSchedTime=1581073963095, lastSndTime=1581074171592, lastRcvTime=1581074171612, readsPaused=false, filterChain=FilterChain[filters=[GridNioCodecFilter [parser=o.a.i.i.util.nio.GridDirectParser@672e22f0, directMode=true], GridConnectionBytesVerifyFilter], accepted=true, markedForClose=false]]]
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:377)
at org.apache.ignite.internal.util.nio.GridNioServer$DirectNioClientWorker.processRead(GridNioServer.java:1282)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeysOptimized(GridNioServer.java:2386)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2153)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1794)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
at java.lang.Thread.run(Thread.java:748)

[11:16:46,612][INFO][grid-nio-worker-tcp-comm-2-#26][TcpCommunicationSpi] Received incoming connection from remote node while connecting to this node, rejecting [locNode=c7e6fc55-d367-43d5-94e9-79ef1d984601, locNodeOrder=1, rmtNode=a66a573a-43dc-48d2-8ee5-232e727acbc9, rmtNodeOrder=19]
[11:16:46,928][INFO][grid-nio-worker-tcp-comm-3-#27][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/172.16.1.7:47100, rmtAddr=/10.139.0.10:38900]
[11:16:46,985][INFO][grid-nio-worker-tcp-comm-3-#27][TcpCommunicationSpi] Received incoming connection from remote node while connecting to this node, rejecting [locNode=c7e6fc55-d367-43d5-94e9-79ef1d984601, locNodeOrder=1, rmtNode=a66a573a-43dc-48d2-8ee5-232e727acbc9, rmtNodeOrder=19]
[11:16:47,301][INFO][grid-nio-worker-tcp-comm-0-#24][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/172.16.1.7:47100, rmtAddr=/10.139.0.10:38902]
[11:16:47,359][INFO][grid-nio-worker-tcp-comm-0-#24][TcpCommunicationSpi] Received incoming connection from remote node while connecting to this node, rejecting [locNode=c7e6fc55-d367-43d5-94e9-79ef1d984601, locNodeOrder=1, rmtNode=a66a573a-43dc-48d2-8ee5-232e727acbc9, rmtNodeOrder=19]
[11:16:47,675][INFO][grid-nio-worker-tcp-comm-1-#25][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/172.16.1.7:47100, rmtAddr=/10.139.0.10:38904]
[11:16:47,733][INFO][grid-nio-worker-tcp-comm-1-#25][TcpCommunicationSpi] Received incoming connection from remote node while connecting to this node, rejecting [locNode=c7e6fc55-d367-43d5-94e9-79ef1d984601, locNodeOrder=1, rmtNode=a66a573a-43dc-48d2-8ee5-232e727acbc9, rmtNodeOrder=19]
[11:16:48,049][INFO][grid-nio-worker-tcp-comm-2-#26][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/172.16.1.7:47100, rmtAddr=/10.139.0.10:38916]
[11:16:48,106][INFO][grid-nio-worker-tcp-comm-2-#26][TcpCommunicationSpi] Received incoming connection from remote node while connecting to this node, rejecting [locNode=c7e6fc55-d367-43d5-94e9-79ef1d984601, locNodeOrder=1, rmtNode=a66a573a-43dc-48d2-8ee5-232e727acbc9, rmtNodeOrder=19]
[11:16:48,423][INFO][grid-nio-worker-tcp-comm-3-#27][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/172.16.1.7:47100, rmtAddr=/10.139.0.10:38918]
[11:16:48,481][INFO][grid-nio-worker-tcp-comm-3-#27][TcpCommunicationSpi] Received incoming connection from remote node while connecting to this node, rejecting [locNode=c7e6fc55-d367-43d5-94e9-79ef1d984601, locNodeOrder=1, rmtNode=a66a573a-43dc-48d2-8ee5-232e727acbc9, rmtNodeOrder=19]

It's a bad sign. I think you either have network problems, or maxed out your communication.

I recommend the following configuration change to TcpCommunicationSpi:

socketWriteTimeout 5000
usePairedConnections true
connectionsPerNode 4.

You may also like to assign localAddr to known good (reachable) IP address of the node, on each node.

Regards,
--
Ilya Kasnacheev


пт, 7 февр. 2020 г. в 14:34, pg31 <[hidden email]>:
Thanks Ilya.

I have changed the Client Side Machine to prefer IPv4 Stack and hence that
error went away. But still the data-streamer-stripes and tcp-comm-worker
threads keep getting stuck.

I am attaching the logs again. (These contain the thread-dump themselves)
log.zip <http://apache-ignite-users.70518.x6.nabble.com/file/t2770/log.zip



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/