Ignite bulk data load issue

classic Classic list List threaded Threaded
9 messages Options
favas favas
Reply | Threaded
Open this post in threaded view
|

Ignite bulk data load issue

Hi,

 

I need help to figure out the issues identified during the bulk load of data into ignite cluster. 

My cluster consist of 5 nodes, each with 8 core CPU, 32 GB RAM and 30GB Disk. Also ignite native persistence is enabled for the table.

I am trying to load data into my ignite SQL table from csv file using COPY command. Each file consist of 50 Million record and I have numerous file of same size. During the initial time of loading, it was quite fast, but after some time the data load become very slow and now it is taking hours to load even a single file. Below is my observations

 

  • When I trigger the load first time after a pause, the CPU usage shows in promising level and that time data load is in higher rate.
  • After loading 2-3 file, the CPU starts dropping down to less than 1 % and it continue in that state for ever.
  • Then I stop the loading processes for some time and re-start, it again perform well and after some time the same situation happens.

 

When I checked the log file, I saw certain THREAD WAIT are happening, I believe due to this waits, the CPU is dropping down. I have attached the entire log file.

Can some one help me to figure out why it so in ignite? Or is it something I have made wrong in my configuration. Below is my configuration file content

 

<bean id="ignite.cfg" class="org.apache.ignite.configuration.IgniteConfiguration">

        <property name="failureDetectionTimeout" value="30000"/>

        <!-- Redefining maximum memory size for the cluster node usage. --> 

        <property name="dataStorageConfiguration">

            <bean class="org.apache.ignite.configuration.DataStorageConfiguration">

 

                <property name="CheckpointReadLockTimeout" value="0" />

                <!-- Set the page size to 4 KB -->

                <property name="pageSize" value="#{4 * 1024}"/>

                <!-- Incraese WAL segment size to 1 GB - Default was 64 MB -->

                <property name="walSegmentSize" value="#{1L * 1024 * 1024 * 1024}"/>

                <!-- Set the wal segment to 5. Default was 10 -->

                <property name="walSegments" value="5"/>

                <!-- Set the wal segment history size to 5. Default was 20 -->

                <property name="WalHistorySize" value="5"/>

                <property name="walCompactionEnabled" value="true" />

                <property name="walCompactionLevel" value="6" />

                <!-- Enable write throttling. -->

                <property name="writeThrottlingEnabled" value="true"/>

            <!-- Redefining the default region's settings -->

            <property name="defaultDataRegionConfiguration">

                <bean class="org.apache.ignite.configuration.DataRegionConfiguration">

                        <property name="persistenceEnabled" value="true"/>

                        <property name="name" value="Default_Region"/>

                        <!-- Setting the size of the default region to 24GB. -->

                        <property name="maxSize" value="#{24L * 1024 * 1024 * 1024}"/>

                        <!-- Increasing the check point buffer size to 2 GB. -->

                        <property name="checkpointPageBufferSize" value="#{2L * 1024 * 1024 * 1024}"/>

                </bean>

            </property>

            </bean>

        </property>

 

        <property name="cacheConfiguration">

            <list>

                <!-- Partitioned cache example configuration -->

                <bean class="org.apache.ignite.configuration.CacheConfiguration">

                    <property name="name" value="default*"/>

                    <property name="cacheMode" value="PARTITIONED" />

                    <property name="atomicityMode" value="TRANSACTIONAL"/>

                    <property name="queryParallelism" value="8" />

                </bean>

            </list>

        </property>

 

 

Regards,

Favas 

 


ignite-5373576b.0.log.zip (478K) Download Attachment
ibelyakov ibelyakov
Reply | Threaded
Open this post in threaded view
|

Re: Ignite bulk data load issue

Hi,

According to the logs, I see a lot of "Possible too long JVM pause" messages, seems like you have a problem with GC activity. You should check GC logs and try to tune GC following this manual:

Also, it's suggested to disable WAL during initial data loading, to reduce amount of disk IO operations. Check "Performance tips" in the bottom of the next page for more details:

Regards,
Igor

On Mon, Sep 23, 2019 at 1:07 PM Muhammed Favas <[hidden email]> wrote:

Hi,

 

I need help to figure out the issues identified during the bulk load of data into ignite cluster. 

My cluster consist of 5 nodes, each with 8 core CPU, 32 GB RAM and 30GB Disk. Also ignite native persistence is enabled for the table.

I am trying to load data into my ignite SQL table from csv file using COPY command. Each file consist of 50 Million record and I have numerous file of same size. During the initial time of loading, it was quite fast, but after some time the data load become very slow and now it is taking hours to load even a single file. Below is my observations

 

  • When I trigger the load first time after a pause, the CPU usage shows in promising level and that time data load is in higher rate.
  • After loading 2-3 file, the CPU starts dropping down to less than 1 % and it continue in that state for ever.
  • Then I stop the loading processes for some time and re-start, it again perform well and after some time the same situation happens.

 

When I checked the log file, I saw certain THREAD WAIT are happening, I believe due to this waits, the CPU is dropping down. I have attached the entire log file.

Can some one help me to figure out why it so in ignite? Or is it something I have made wrong in my configuration. Below is my configuration file content

 

<bean id="ignite.cfg" class="org.apache.ignite.configuration.IgniteConfiguration">

        <property name="failureDetectionTimeout" value="30000"/>

        <!-- Redefining maximum memory size for the cluster node usage. --> 

        <property name="dataStorageConfiguration">

            <bean class="org.apache.ignite.configuration.DataStorageConfiguration">

 

                <property name="CheckpointReadLockTimeout" value="0" />

                <!-- Set the page size to 4 KB -->

                <property name="pageSize" value="#{4 * 1024}"/>

                <!-- Incraese WAL segment size to 1 GB - Default was 64 MB -->

                <property name="walSegmentSize" value="#{1L * 1024 * 1024 * 1024}"/>

                <!-- Set the wal segment to 5. Default was 10 -->

                <property name="walSegments" value="5"/>

                <!-- Set the wal segment history size to 5. Default was 20 -->

                <property name="WalHistorySize" value="5"/>

                <property name="walCompactionEnabled" value="true" />

                <property name="walCompactionLevel" value="6" />

                <!-- Enable write throttling. -->

                <property name="writeThrottlingEnabled" value="true"/>

            <!-- Redefining the default region's settings -->

            <property name="defaultDataRegionConfiguration">

                <bean class="org.apache.ignite.configuration.DataRegionConfiguration">

                        <property name="persistenceEnabled" value="true"/>

                        <property name="name" value="Default_Region"/>

                        <!-- Setting the size of the default region to 24GB. -->

                        <property name="maxSize" value="#{24L * 1024 * 1024 * 1024}"/>

                        <!-- Increasing the check point buffer size to 2 GB. -->

                        <property name="checkpointPageBufferSize" value="#{2L * 1024 * 1024 * 1024}"/>

                </bean>

            </property>

            </bean>

        </property>

 

        <property name="cacheConfiguration">

            <list>

                <!-- Partitioned cache example configuration -->

                <bean class="org.apache.ignite.configuration.CacheConfiguration">

                    <property name="name" value="default*"/>

                    <property name="cacheMode" value="PARTITIONED" />

                    <property name="atomicityMode" value="TRANSACTIONAL"/>

                    <property name="queryParallelism" value="8" />

                </bean>

            </list>

        </property>

 

 

Regards,

Favas 

 

ezhuravlev ezhuravlev
Reply | Threaded
Open this post in threaded view
|

Re: Ignite bulk data load issue

In reply to this post by favas
Hi,

Looks like you can have really slow disks. What kind of disk do you have there? I see throttling in logs, Because the write operation is really slow. 

Evgenii

пн, 23 сент. 2019 г. в 13:07, Muhammed Favas <[hidden email]>:

Hi,

 

I need help to figure out the issues identified during the bulk load of data into ignite cluster. 

My cluster consist of 5 nodes, each with 8 core CPU, 32 GB RAM and 30GB Disk. Also ignite native persistence is enabled for the table.

I am trying to load data into my ignite SQL table from csv file using COPY command. Each file consist of 50 Million record and I have numerous file of same size. During the initial time of loading, it was quite fast, but after some time the data load become very slow and now it is taking hours to load even a single file. Below is my observations

 

  • When I trigger the load first time after a pause, the CPU usage shows in promising level and that time data load is in higher rate.
  • After loading 2-3 file, the CPU starts dropping down to less than 1 % and it continue in that state for ever.
  • Then I stop the loading processes for some time and re-start, it again perform well and after some time the same situation happens.

 

When I checked the log file, I saw certain THREAD WAIT are happening, I believe due to this waits, the CPU is dropping down. I have attached the entire log file.

Can some one help me to figure out why it so in ignite? Or is it something I have made wrong in my configuration. Below is my configuration file content

 

<bean id="ignite.cfg" class="org.apache.ignite.configuration.IgniteConfiguration">

        <property name="failureDetectionTimeout" value="30000"/>

        <!-- Redefining maximum memory size for the cluster node usage. --> 

        <property name="dataStorageConfiguration">

            <bean class="org.apache.ignite.configuration.DataStorageConfiguration">

 

                <property name="CheckpointReadLockTimeout" value="0" />

                <!-- Set the page size to 4 KB -->

                <property name="pageSize" value="#{4 * 1024}"/>

                <!-- Incraese WAL segment size to 1 GB - Default was 64 MB -->

                <property name="walSegmentSize" value="#{1L * 1024 * 1024 * 1024}"/>

                <!-- Set the wal segment to 5. Default was 10 -->

                <property name="walSegments" value="5"/>

                <!-- Set the wal segment history size to 5. Default was 20 -->

                <property name="WalHistorySize" value="5"/>

                <property name="walCompactionEnabled" value="true" />

                <property name="walCompactionLevel" value="6" />

                <!-- Enable write throttling. -->

                <property name="writeThrottlingEnabled" value="true"/>

            <!-- Redefining the default region's settings -->

            <property name="defaultDataRegionConfiguration">

                <bean class="org.apache.ignite.configuration.DataRegionConfiguration">

                        <property name="persistenceEnabled" value="true"/>

                        <property name="name" value="Default_Region"/>

                        <!-- Setting the size of the default region to 24GB. -->

                        <property name="maxSize" value="#{24L * 1024 * 1024 * 1024}"/>

                        <!-- Increasing the check point buffer size to 2 GB. -->

                        <property name="checkpointPageBufferSize" value="#{2L * 1024 * 1024 * 1024}"/>

                </bean>

            </property>

            </bean>

        </property>

 

        <property name="cacheConfiguration">

            <list>

                <!-- Partitioned cache example configuration -->

                <bean class="org.apache.ignite.configuration.CacheConfiguration">

                    <property name="name" value="default*"/>

                    <property name="cacheMode" value="PARTITIONED" />

                    <property name="atomicityMode" value="TRANSACTIONAL"/>

                    <property name="queryParallelism" value="8" />

                </bean>

            </list>

        </property>

 

 

Regards,

Favas 

 

favas favas
Reply | Threaded
Open this post in threaded view
|

RE: Ignite bulk data load issue

In reply to this post by ibelyakov

Thanks Igor for the suggestions.

 

But, In the log JVM pause is not showing too much of time consumption(around 500 milliseconds) and I assume it is not the only problem of time taking. There are other reasons I expect for this strange behavior.

Even I tried with WAL disabled, still the time is going high after few files import.

 

Is there any way to figure out why so many THREAD WAIT are happening. Because of it there are SEVERE errors logging in the ignite nodes. Below are some sample errors from the log file I shared. Please help me here.

 

 

[05:02:45,413][SEVERE][tcp-disco-msg-worker-#2][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=data-streamer-stripe-0, blockedFor=34s]

[05:02:45,414][WARNING][tcp-disco-msg-worker-#2][G] Thread [name="data-streamer-stripe-0-#9", id=22, state=RUNNABLE, blockCnt=0, waitCnt=4375]

 

[05:02:45,415][SEVERE][tcp-disco-msg-worker-#2][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=data-streamer-stripe-0, igniteInstanceName=null, finished=false, heartbeatTs=1569214931358]]]

class org.apache.ignite.IgniteException: GridWorker [name=data-streamer-stripe-0, igniteInstanceName=null, finished=false, heartbeatTs=1569214931358]

                at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1831)

                at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826)

                at org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233)

                at org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:297)

                at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.lambda$new$0(ServerImpl.java:2663)

                at org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorker.body(ServerImpl.java:7181)

                at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2700)

                at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)

                at org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerThread.body(ServerImpl.java:7119)

                at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)

 

 

Thread [name="sys-#367", id=468, state=TIMED_WAITING, blockCnt=0, waitCnt=1]

    Lock [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@7a4b4b27, ownerName=null, ownerId=-1]

        at sun.misc.Unsafe.park(Native Method)

        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)

        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)

        at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)

        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

        at java.lang.Thread.run(Thread.java:748)

 

Thread [name="sys-#366", id=467, state=TIMED_WAITING, blockCnt=0, waitCnt=1]

    Lock [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@7a4b4b27, ownerName=null, ownerId=-1]

        at sun.misc.Unsafe.park(Native Method)

        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)

        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)

        at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)

        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

        at java.lang.Thread.run(Thread.java:748).

 

 

Regards,

Favas 

 

From: Igor Belyakov <[hidden email]>
Sent: Monday, September 23, 2019 7:37 PM
To: [hidden email]
Subject: Re: Ignite bulk data load issue

 

Hi,

 

According to the logs, I see a lot of "Possible too long JVM pause" messages, seems like you have a problem with GC activity. You should check GC logs and try to tune GC following this manual:

 

Also, it's suggested to disable WAL during initial data loading, to reduce amount of disk IO operations. Check "Performance tips" in the bottom of the next page for more details:

 

Regards,

Igor

 

On Mon, Sep 23, 2019 at 1:07 PM Muhammed Favas <[hidden email]> wrote:

Hi,

 

I need help to figure out the issues identified during the bulk load of data into ignite cluster. 

My cluster consist of 5 nodes, each with 8 core CPU, 32 GB RAM and 30GB Disk. Also ignite native persistence is enabled for the table.

I am trying to load data into my ignite SQL table from csv file using COPY command. Each file consist of 50 Million record and I have numerous file of same size. During the initial time of loading, it was quite fast, but after some time the data load become very slow and now it is taking hours to load even a single file. Below is my observations

 

  • When I trigger the load first time after a pause, the CPU usage shows in promising level and that time data load is in higher rate.
  • After loading 2-3 file, the CPU starts dropping down to less than 1 % and it continue in that state for ever.
  • Then I stop the loading processes for some time and re-start, it again perform well and after some time the same situation happens.

 

When I checked the log file, I saw certain THREAD WAIT are happening, I believe due to this waits, the CPU is dropping down. I have attached the entire log file.

Can some one help me to figure out why it so in ignite? Or is it something I have made wrong in my configuration. Below is my configuration file content

 

<bean id="ignite.cfg" class="org.apache.ignite.configuration.IgniteConfiguration">

        <property name="failureDetectionTimeout" value="30000"/>

        <!-- Redefining maximum memory size for the cluster node usage. --> 

        <property name="dataStorageConfiguration">

            <bean class="org.apache.ignite.configuration.DataStorageConfiguration">

 

                <property name="CheckpointReadLockTimeout" value="0" />

                <!-- Set the page size to 4 KB -->

                <property name="pageSize" value="#{4 * 1024}"/>

                <!-- Incraese WAL segment size to 1 GB - Default was 64 MB -->

                <property name="walSegmentSize" value="#{1L * 1024 * 1024 * 1024}"/>

                <!-- Set the wal segment to 5. Default was 10 -->

                <property name="walSegments" value="5"/>

                <!-- Set the wal segment history size to 5. Default was 20 -->

                <property name="WalHistorySize" value="5"/>

                <property name="walCompactionEnabled" value="true" />

                <property name="walCompactionLevel" value="6" />

                <!-- Enable write throttling. -->

                <property name="writeThrottlingEnabled" value="true"/>

            <!-- Redefining the default region's settings -->

            <property name="defaultDataRegionConfiguration">

                <bean class="org.apache.ignite.configuration.DataRegionConfiguration">

                        <property name="persistenceEnabled" value="true"/>

                        <property name="name" value="Default_Region"/>

                        <!-- Setting the size of the default region to 24GB. -->

                        <property name="maxSize" value="#{24L * 1024 * 1024 * 1024}"/>

                        <!-- Increasing the check point buffer size to 2 GB. -->

                        <property name="checkpointPageBufferSize" value="#{2L * 1024 * 1024 * 1024}"/>

                </bean>

            </property>

            </bean>

        </property>

 

        <property name="cacheConfiguration">

            <list>

                <!-- Partitioned cache example configuration -->

                <bean class="org.apache.ignite.configuration.CacheConfiguration">

                    <property name="name" value="default*"/>

                    <property name="cacheMode" value="PARTITIONED" />

                    <property name="atomicityMode" value="TRANSACTIONAL"/>

                    <property name="queryParallelism" value="8" />

                </bean>

            </list>

        </property>

 

 

Regards,

Favas 

 

favas favas
Reply | Threaded
Open this post in threaded view
|

RE: Ignite bulk data load issue

In reply to this post by ezhuravlev

Thanks Evgenii,

 

My cluster is AWS EC2 and below is the disk specification on each node

 

 

Regards,

Favas 

 

From: Evgenii Zhuravlev <[hidden email]>
Sent: Tuesday, September 24, 2019 12:44 PM
To: [hidden email]
Subject: Re: Ignite bulk data load issue

 

Hi,

 

Looks like you can have really slow disks. What kind of disk do you have there? I see throttling in logs, Because the write operation is really slow. 

 

Evgenii

 

пн, 23 сент. 2019 г. в 13:07, Muhammed Favas <[hidden email]>:

Hi,

 

I need help to figure out the issues identified during the bulk load of data into ignite cluster. 

My cluster consist of 5 nodes, each with 8 core CPU, 32 GB RAM and 30GB Disk. Also ignite native persistence is enabled for the table.

I am trying to load data into my ignite SQL table from csv file using COPY command. Each file consist of 50 Million record and I have numerous file of same size. During the initial time of loading, it was quite fast, but after some time the data load become very slow and now it is taking hours to load even a single file. Below is my observations

 

  • When I trigger the load first time after a pause, the CPU usage shows in promising level and that time data load is in higher rate.
  • After loading 2-3 file, the CPU starts dropping down to less than 1 % and it continue in that state for ever.
  • Then I stop the loading processes for some time and re-start, it again perform well and after some time the same situation happens.

 

When I checked the log file, I saw certain THREAD WAIT are happening, I believe due to this waits, the CPU is dropping down. I have attached the entire log file.

Can some one help me to figure out why it so in ignite? Or is it something I have made wrong in my configuration. Below is my configuration file content

 

<bean id="ignite.cfg" class="org.apache.ignite.configuration.IgniteConfiguration">

        <property name="failureDetectionTimeout" value="30000"/>

        <!-- Redefining maximum memory size for the cluster node usage. --> 

        <property name="dataStorageConfiguration">

            <bean class="org.apache.ignite.configuration.DataStorageConfiguration">

 

                <property name="CheckpointReadLockTimeout" value="0" />

                <!-- Set the page size to 4 KB -->

                <property name="pageSize" value="#{4 * 1024}"/>

                <!-- Incraese WAL segment size to 1 GB - Default was 64 MB -->

                <property name="walSegmentSize" value="#{1L * 1024 * 1024 * 1024}"/>

                <!-- Set the wal segment to 5. Default was 10 -->

                <property name="walSegments" value="5"/>

                <!-- Set the wal segment history size to 5. Default was 20 -->

                <property name="WalHistorySize" value="5"/>

                <property name="walCompactionEnabled" value="true" />

                <property name="walCompactionLevel" value="6" />

                <!-- Enable write throttling. -->

                <property name="writeThrottlingEnabled" value="true"/>

            <!-- Redefining the default region's settings -->

            <property name="defaultDataRegionConfiguration">

                <bean class="org.apache.ignite.configuration.DataRegionConfiguration">

                        <property name="persistenceEnabled" value="true"/>

                        <property name="name" value="Default_Region"/>

                        <!-- Setting the size of the default region to 24GB. -->

                        <property name="maxSize" value="#{24L * 1024 * 1024 * 1024}"/>

                        <!-- Increasing the check point buffer size to 2 GB. -->

                        <property name="checkpointPageBufferSize" value="#{2L * 1024 * 1024 * 1024}"/>

                </bean>

            </property>

            </bean>

        </property>

 

        <property name="cacheConfiguration">

            <list>

                <!-- Partitioned cache example configuration -->

                <bean class="org.apache.ignite.configuration.CacheConfiguration">

                    <property name="name" value="default*"/>

                    <property name="cacheMode" value="PARTITIONED" />

                    <property name="atomicityMode" value="TRANSACTIONAL"/>

                    <property name="queryParallelism" value="8" />

                </bean>

            </list>

        </property>

 

 

Regards,

Favas 

 

ezhuravlev ezhuravlev
Reply | Threaded
Open this post in threaded view
|

Re: Ignite bulk data load issue

Well, yes, that's definitely could be the reason - it's probably not enough. To make initial data load faster, you can either disable WAL for some time or move WAL to the separate disk: https://apacheignite.readme.io/docs/durable-memory-tuning#section-separate-disk-device-for-wal

Also, you can just choose io1 disk type or give more IOPS to the disk - it will help too.

Best Regards,
Evgenii

вт, 24 сент. 2019 г. в 11:54, Muhammed Favas <[hidden email]>:

Thanks Evgenii,

 

My cluster is AWS EC2 and below is the disk specification on each node

 

 

Regards,

Favas 

 

From: Evgenii Zhuravlev <[hidden email]>
Sent: Tuesday, September 24, 2019 12:44 PM
To: [hidden email]
Subject: Re: Ignite bulk data load issue

 

Hi,

 

Looks like you can have really slow disks. What kind of disk do you have there? I see throttling in logs, Because the write operation is really slow. 

 

Evgenii

 

пн, 23 сент. 2019 г. в 13:07, Muhammed Favas <[hidden email]>:

Hi,

 

I need help to figure out the issues identified during the bulk load of data into ignite cluster. 

My cluster consist of 5 nodes, each with 8 core CPU, 32 GB RAM and 30GB Disk. Also ignite native persistence is enabled for the table.

I am trying to load data into my ignite SQL table from csv file using COPY command. Each file consist of 50 Million record and I have numerous file of same size. During the initial time of loading, it was quite fast, but after some time the data load become very slow and now it is taking hours to load even a single file. Below is my observations

 

  • When I trigger the load first time after a pause, the CPU usage shows in promising level and that time data load is in higher rate.
  • After loading 2-3 file, the CPU starts dropping down to less than 1 % and it continue in that state for ever.
  • Then I stop the loading processes for some time and re-start, it again perform well and after some time the same situation happens.

 

When I checked the log file, I saw certain THREAD WAIT are happening, I believe due to this waits, the CPU is dropping down. I have attached the entire log file.

Can some one help me to figure out why it so in ignite? Or is it something I have made wrong in my configuration. Below is my configuration file content

 

<bean id="ignite.cfg" class="org.apache.ignite.configuration.IgniteConfiguration">

        <property name="failureDetectionTimeout" value="30000"/>

        <!-- Redefining maximum memory size for the cluster node usage. --> 

        <property name="dataStorageConfiguration">

            <bean class="org.apache.ignite.configuration.DataStorageConfiguration">

 

                <property name="CheckpointReadLockTimeout" value="0" />

                <!-- Set the page size to 4 KB -->

                <property name="pageSize" value="#{4 * 1024}"/>

                <!-- Incraese WAL segment size to 1 GB - Default was 64 MB -->

                <property name="walSegmentSize" value="#{1L * 1024 * 1024 * 1024}"/>

                <!-- Set the wal segment to 5. Default was 10 -->

                <property name="walSegments" value="5"/>

                <!-- Set the wal segment history size to 5. Default was 20 -->

                <property name="WalHistorySize" value="5"/>

                <property name="walCompactionEnabled" value="true" />

                <property name="walCompactionLevel" value="6" />

                <!-- Enable write throttling. -->

                <property name="writeThrottlingEnabled" value="true"/>

            <!-- Redefining the default region's settings -->

            <property name="defaultDataRegionConfiguration">

                <bean class="org.apache.ignite.configuration.DataRegionConfiguration">

                        <property name="persistenceEnabled" value="true"/>

                        <property name="name" value="Default_Region"/>

                        <!-- Setting the size of the default region to 24GB. -->

                        <property name="maxSize" value="#{24L * 1024 * 1024 * 1024}"/>

                        <!-- Increasing the check point buffer size to 2 GB. -->

                        <property name="checkpointPageBufferSize" value="#{2L * 1024 * 1024 * 1024}"/>

                </bean>

            </property>

            </bean>

        </property>

 

        <property name="cacheConfiguration">

            <list>

                <!-- Partitioned cache example configuration -->

                <bean class="org.apache.ignite.configuration.CacheConfiguration">

                    <property name="name" value="default*"/>

                    <property name="cacheMode" value="PARTITIONED" />

                    <property name="atomicityMode" value="TRANSACTIONAL"/>

                    <property name="queryParallelism" value="8" />

                </bean>

            </list>

        </property>

 

 

Regards,

Favas 

 

favas favas
Reply | Threaded
Open this post in threaded view
|

RE: Ignite bulk data load issue

In reply to this post by favas

Hi all,

 

Can some one help me here. I still facing the serious performance issue.

 

Regards,

Favas 

 

From: Muhammed Favas <[hidden email]>
Sent: Tuesday, September 24, 2019 2:15 PM
To: [hidden email]
Subject: RE: Ignite bulk data load issue

 

Thanks Igor for the suggestions.

 

But, In the log JVM pause is not showing too much of time consumption(around 500 milliseconds) and I assume it is not the only problem of time taking. There are other reasons I expect for this strange behavior.

Even I tried with WAL disabled, still the time is going high after few files import.

 

Is there any way to figure out why so many THREAD WAIT are happening. Because of it there are SEVERE errors logging in the ignite nodes. Below are some sample errors from the log file I shared. Please help me here.

 

 

[05:02:45,413][SEVERE][tcp-disco-msg-worker-#2][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=data-streamer-stripe-0, blockedFor=34s]

[05:02:45,414][WARNING][tcp-disco-msg-worker-#2][G] Thread [name="data-streamer-stripe-0-#9", id=22, state=RUNNABLE, blockCnt=0, waitCnt=4375]

 

[05:02:45,415][SEVERE][tcp-disco-msg-worker-#2][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=data-streamer-stripe-0, igniteInstanceName=null, finished=false, heartbeatTs=1569214931358]]]

class org.apache.ignite.IgniteException: GridWorker [name=data-streamer-stripe-0, igniteInstanceName=null, finished=false, heartbeatTs=1569214931358]

                at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1831)

                at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826)

                at org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233)

                at org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:297)

                at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.lambda$new$0(ServerImpl.java:2663)

                at org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorker.body(ServerImpl.java:7181)

                at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2700)

                at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)

                at org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerThread.body(ServerImpl.java:7119)

                at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)

 

 

Thread [name="sys-#367", id=468, state=TIMED_WAITING, blockCnt=0, waitCnt=1]

    Lock [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@7a4b4b27, ownerName=null, ownerId=-1]

        at sun.misc.Unsafe.park(Native Method)

        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)

        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)

        at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)

        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

        at java.lang.Thread.run(Thread.java:748)

 

Thread [name="sys-#366", id=467, state=TIMED_WAITING, blockCnt=0, waitCnt=1]

    Lock [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@7a4b4b27, ownerName=null, ownerId=-1]

        at sun.misc.Unsafe.park(Native Method)

        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)

        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)

        at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)

        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

        at java.lang.Thread.run(Thread.java:748).

 

 

Regards,

Favas 

 

From: Igor Belyakov <[hidden email]>
Sent: Monday, September 23, 2019 7:37 PM
To: [hidden email]
Subject: Re: Ignite bulk data load issue

 

Hi,

 

According to the logs, I see a lot of "Possible too long JVM pause" messages, seems like you have a problem with GC activity. You should check GC logs and try to tune GC following this manual:

 

Also, it's suggested to disable WAL during initial data loading, to reduce amount of disk IO operations. Check "Performance tips" in the bottom of the next page for more details:

 

Regards,

Igor

 

On Mon, Sep 23, 2019 at 1:07 PM Muhammed Favas <[hidden email]> wrote:

Hi,

 

I need help to figure out the issues identified during the bulk load of data into ignite cluster. 

My cluster consist of 5 nodes, each with 8 core CPU, 32 GB RAM and 30GB Disk. Also ignite native persistence is enabled for the table.

I am trying to load data into my ignite SQL table from csv file using COPY command. Each file consist of 50 Million record and I have numerous file of same size. During the initial time of loading, it was quite fast, but after some time the data load become very slow and now it is taking hours to load even a single file. Below is my observations

 

  • When I trigger the load first time after a pause, the CPU usage shows in promising level and that time data load is in higher rate.
  • After loading 2-3 file, the CPU starts dropping down to less than 1 % and it continue in that state for ever.
  • Then I stop the loading processes for some time and re-start, it again perform well and after some time the same situation happens.

 

When I checked the log file, I saw certain THREAD WAIT are happening, I believe due to this waits, the CPU is dropping down. I have attached the entire log file.

Can some one help me to figure out why it so in ignite? Or is it something I have made wrong in my configuration. Below is my configuration file content

 

<bean id="ignite.cfg" class="org.apache.ignite.configuration.IgniteConfiguration">

        <property name="failureDetectionTimeout" value="30000"/>

        <!-- Redefining maximum memory size for the cluster node usage. --> 

        <property name="dataStorageConfiguration">

            <bean class="org.apache.ignite.configuration.DataStorageConfiguration">

 

                <property name="CheckpointReadLockTimeout" value="0" />

                <!-- Set the page size to 4 KB -->

                <property name="pageSize" value="#{4 * 1024}"/>

                <!-- Incraese WAL segment size to 1 GB - Default was 64 MB -->

                <property name="walSegmentSize" value="#{1L * 1024 * 1024 * 1024}"/>

                <!-- Set the wal segment to 5. Default was 10 -->

                <property name="walSegments" value="5"/>

                <!-- Set the wal segment history size to 5. Default was 20 -->

                <property name="WalHistorySize" value="5"/>

                <property name="walCompactionEnabled" value="true" />

                <property name="walCompactionLevel" value="6" />

                <!-- Enable write throttling. -->

                <property name="writeThrottlingEnabled" value="true"/>

            <!-- Redefining the default region's settings -->

            <property name="defaultDataRegionConfiguration">

                <bean class="org.apache.ignite.configuration.DataRegionConfiguration">

                        <property name="persistenceEnabled" value="true"/>

                        <property name="name" value="Default_Region"/>

                        <!-- Setting the size of the default region to 24GB. -->

                        <property name="maxSize" value="#{24L * 1024 * 1024 * 1024}"/>

                        <!-- Increasing the check point buffer size to 2 GB. -->

                        <property name="checkpointPageBufferSize" value="#{2L * 1024 * 1024 * 1024}"/>

                </bean>

            </property>

            </bean>

        </property>

 

        <property name="cacheConfiguration">

            <list>

                <!-- Partitioned cache example configuration -->

                <bean class="org.apache.ignite.configuration.CacheConfiguration">

                    <property name="name" value="default*"/>

                    <property name="cacheMode" value="PARTITIONED" />

                    <property name="atomicityMode" value="TRANSACTIONAL"/>

                    <property name="queryParallelism" value="8" />

                </bean>

            </list>

        </property>

 

 

Regards,

Favas 

 

favas favas
Reply | Threaded
Open this post in threaded view
|

RE: Ignite bulk data load issue

In reply to this post by favas

 

Hi,

 

I need help to figure out the issues identified during the bulk load of data into ignite cluster. 

My cluster consist of 5 nodes, each with 8 core CPU, 32 GB RAM and 30GB Disk. Also ignite native persistence is enabled for the table.

I am trying to load data into my ignite SQL table from csv file using COPY command. Each file consist of 50 Million record and I have numerous file of same size. During the initial time of loading, it was quite fast, but after some time the data load become very slow and now it is taking hours to load even a single file. Below is my observations

 

  • When I trigger the load first time after a pause, the CPU usage shows in promising level and that time data load is in higher rate.
  • After loading 2-3 file, the CPU starts dropping down to less than 1 % and it continue in that state for ever.
  • Then I stop the loading processes for some time and re-start, it again perform well and after some time the same situation happens.

 

When I checked the log file, I saw certain THREAD WAIT are happening, I believe due to this waits, the CPU is dropping down. I have attached the entire log file.

Can some one help me to figure out why it so in ignite? Or is it something I have made wrong in my configuration. Below is my configuration file content

 

<bean id="ignite.cfg" class="org.apache.ignite.configuration.IgniteConfiguration">

        <property name="failureDetectionTimeout" value="30000"/>

        <!-- Redefining maximum memory size for the cluster node usage. --> 

        <property name="dataStorageConfiguration">

            <bean class="org.apache.ignite.configuration.DataStorageConfiguration">

 

                <property name="CheckpointReadLockTimeout" value="0" />

                <!-- Set the page size to 4 KB -->

                <property name="pageSize" value="#{4 * 1024}"/>

                <!-- Incraese WAL segment size to 1 GB - Default was 64 MB -->

                <property name="walSegmentSize" value="#{1L * 1024 * 1024 * 1024}"/>

                <!-- Set the wal segment to 5. Default was 10 -->

                <property name="walSegments" value="5"/>

                <!-- Set the wal segment history size to 5. Default was 20 -->

                <property name="WalHistorySize" value="5"/>

                <property name="walCompactionEnabled" value="true" />

                <property name="walCompactionLevel" value="6" />

                <!-- Enable write throttling. -->

                <property name="writeThrottlingEnabled" value="true"/>

            <!-- Redefining the default region's settings -->

            <property name="defaultDataRegionConfiguration">

                <bean class="org.apache.ignite.configuration.DataRegionConfiguration">

                        <property name="persistenceEnabled" value="true"/>

                        <property name="name" value="Default_Region"/>

                        <!-- Setting the size of the default region to 24GB. -->

                        <property name="maxSize" value="#{24L * 1024 * 1024 * 1024}"/>

                        <!-- Increasing the check point buffer size to 2 GB. -->

                        <property name="checkpointPageBufferSize" value="#{2L * 1024 * 1024 * 1024}"/>

                </bean>

            </property>

            </bean>

        </property>

 

        <property name="cacheConfiguration">

            <list>

                <!-- Partitioned cache example configuration -->

                <bean class="org.apache.ignite.configuration.CacheConfiguration">

                    <property name="name" value="default*"/>

                    <property name="cacheMode" value="PARTITIONED" />

                    <property name="atomicityMode" value="TRANSACTIONAL"/>

                    <property name="queryParallelism" value="8" />

                </bean>

            </list>

        </property>

 

 

Regards,

Favas 

 


ignite-5373576b.0.log.zip (478K) Download Attachment
ezhuravlev ezhuravlev
Reply | Threaded
Open this post in threaded view
|

RE: Ignite bulk data load issue

This message:
 Blocked system-critical thread has been detected. This can lead to
cluster-wide undefined behaviour [threadName=data-streamer-stripe-0,
blockedFor=34s]
is being printed in case if you're loading data to Ignite much faster than
it can write to the disk. The disk is a bottleneck here, as you mentioned
before, you chose a disk with a really small IOPS. Just give a better disk
to the system and you will see a better performance.

By the way, how much heap do you have for a JVM?

Evgenii



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/