Binary recovery for a very long time

classic Classic list List threaded Threaded
11 messages Options
38797715 38797715
Reply | Threaded
Open this post in threaded view
|

Binary recovery for a very long time

Hi community,

We have 5 servers, 16 cores, 256g memory, and 200g off-heap memory.
We have 7 tables to test, and the data volume is respectively:31.8M,495.2M,552.3M,33M,873.3K,28M,1.9K(replicated),others are partitioned(backup = 1)

VM args:-server -Xms20g -Xmx20g -XX:+AlwaysPreTouch -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M -Xloggc:/data/gc/logs/gclog.txt -Djava.net.preferIPv4Stack=true -XX:MaxDirectMemorySize=256M -XX:+PrintAdaptiveSizePolicy

Today, one of the servers was restarted(kill and then start ignite.sh) for some reason, but the node took 1.5 hours to start, which was much longer than expected.

After analyzing the log, the following information is found:

[2020-05-12T17:00:05,138][INFO ][main][GridCacheDatabaseSharedManager] Found last checkpoint marker [cpId=7a0564f2-43e5-400b-9439-746fc68a6ccb, pos=FileWALPointer [idx=10511, fileOff=51348888, len=61193]]
[2020-05-12T17:00:05,151][INFO ][main][GridCacheDatabaseSharedManager] Binary memory state restored at node startup [restoredPtr=FileWALPointer [idx=10511, fileOff=51410110, len=0]]
[2020-05-12T17:00:05,152][INFO ][main][FileWriteAheadLogManager] Resuming logging to WAL segment [file=/appdata/ignite/db/wal/24/0000000000000001.wal, offset=51410110, ver=2]
[2020-05-12T17:00:06,448][INFO ][main][PageMemoryImpl] Started page memory [memoryAllocated=200.0 GiB, pages=50821088, tableSize=3.9 GiB, checkpointBuffer=2.0 GiB]
[2020-05-12T17:02:08,528][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO_NEW, id=-189779360, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T17:50:44,341][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO_LINE, id=-1588248812, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T17:50:44,366][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=ignite-sys-cache, id=-2100569601, dataRegionName=sysMemPlc, mode=REPLICATED, atomicity=TRANSACTIONAL, backups=2147483647, mvcc=false]
[2020-05-12T18:17:57,071][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO_LINE_NEW, id=1742991829, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:19:54,910][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=PI_COM_DAY, id=-1904194728, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:19:54,949][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=PLM_ITEM, id=-1283854143, dataRegionName=default, mode=REPLICATED, atomicity=ATOMIC, backups=2147483647, mvcc=false]
[2020-05-12T18:22:53,662][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO, id=64322847, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:22:54,876][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CUST, id=1684722246, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:22:54,892][INFO ][main][GridCacheDatabaseSharedManager] Binary recovery performed in 4970233 ms.

Among them, binary recovery took 4970 seconds.

Our question is:

1.Why is the start time so long?

2.Is the current state of ignite, with the growth of single node data volume, the restart time will be longer and longer?

3.Do have any suggestions for optimizing the restart time?

38797715 38797715
Reply | Threaded
Open this post in threaded view
|

Re: Binary recovery for a very long time

Among them:
CO_CO_NEW: ~ 48 minutes(partitioned,backup=1,33M)

Ignite sys cache: ~ 27 minutes

PLM_ITEM:~3 minutes(repicated,1.9K)


在 2020/5/12 下午9:08, 38797715 写道:

Hi community,

We have 5 servers, 16 cores, 256g memory, and 200g off-heap memory.
We have 7 tables to test, and the data volume is respectively:31.8M,495.2M,552.3M,33M,873.3K,28M,1.9K(replicated),others are partitioned(backup = 1)

VM args:-server -Xms20g -Xmx20g -XX:+AlwaysPreTouch -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M -Xloggc:/data/gc/logs/gclog.txt -Djava.net.preferIPv4Stack=true -XX:MaxDirectMemorySize=256M -XX:+PrintAdaptiveSizePolicy

Today, one of the servers was restarted(kill and then start ignite.sh) for some reason, but the node took 1.5 hours to start, which was much longer than expected.

After analyzing the log, the following information is found:

[2020-05-12T17:00:05,138][INFO ][main][GridCacheDatabaseSharedManager] Found last checkpoint marker [cpId=7a0564f2-43e5-400b-9439-746fc68a6ccb, pos=FileWALPointer [idx=10511, fileOff=51348888, len=61193]]
[2020-05-12T17:00:05,151][INFO ][main][GridCacheDatabaseSharedManager] Binary memory state restored at node startup [restoredPtr=FileWALPointer [idx=10511, fileOff=51410110, len=0]]
[2020-05-12T17:00:05,152][INFO ][main][FileWriteAheadLogManager] Resuming logging to WAL segment [file=/appdata/ignite/db/wal/24/0000000000000001.wal, offset=51410110, ver=2]
[2020-05-12T17:00:06,448][INFO ][main][PageMemoryImpl] Started page memory [memoryAllocated=200.0 GiB, pages=50821088, tableSize=3.9 GiB, checkpointBuffer=2.0 GiB]
[2020-05-12T17:02:08,528][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO_NEW, id=-189779360, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T17:50:44,341][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO_LINE, id=-1588248812, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T17:50:44,366][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=ignite-sys-cache, id=-2100569601, dataRegionName=sysMemPlc, mode=REPLICATED, atomicity=TRANSACTIONAL, backups=2147483647, mvcc=false]
[2020-05-12T18:17:57,071][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO_LINE_NEW, id=1742991829, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:19:54,910][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=PI_COM_DAY, id=-1904194728, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:19:54,949][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=PLM_ITEM, id=-1283854143, dataRegionName=default, mode=REPLICATED, atomicity=ATOMIC, backups=2147483647, mvcc=false]
[2020-05-12T18:22:53,662][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO, id=64322847, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:22:54,876][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CUST, id=1684722246, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:22:54,892][INFO ][main][GridCacheDatabaseSharedManager] Binary recovery performed in 4970233 ms.

Among them, binary recovery took 4970 seconds.

Our question is:

1.Why is the start time so long?

2.Is the current state of ignite, with the growth of single node data volume, the restart time will be longer and longer?

3.Do have any suggestions for optimizing the restart time?

ezhuravlev ezhuravlev
Reply | Threaded
Open this post in threaded view
|

Re: Binary recovery for a very long time

Hi,

Can you share full logs and configuration? What disk so you use?

Evgenii

вт, 12 мая 2020 г. в 06:49, 38797715 <[hidden email]>:

Among them:
CO_CO_NEW: ~ 48 minutes(partitioned,backup=1,33M)

Ignite sys cache: ~ 27 minutes

PLM_ITEM:~3 minutes(repicated,1.9K)


在 2020/5/12 下午9:08, 38797715 写道:

Hi community,

We have 5 servers, 16 cores, 256g memory, and 200g off-heap memory.
We have 7 tables to test, and the data volume is respectively:31.8M,495.2M,552.3M,33M,873.3K,28M,1.9K(replicated),others are partitioned(backup = 1)

VM args:-server -Xms20g -Xmx20g -XX:+AlwaysPreTouch -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M -Xloggc:/data/gc/logs/gclog.txt -Djava.net.preferIPv4Stack=true -XX:MaxDirectMemorySize=256M -XX:+PrintAdaptiveSizePolicy

Today, one of the servers was restarted(kill and then start ignite.sh) for some reason, but the node took 1.5 hours to start, which was much longer than expected.

After analyzing the log, the following information is found:

[2020-05-12T17:00:05,138][INFO ][main][GridCacheDatabaseSharedManager] Found last checkpoint marker [cpId=7a0564f2-43e5-400b-9439-746fc68a6ccb, pos=FileWALPointer [idx=10511, fileOff=51348888, len=61193]]
[2020-05-12T17:00:05,151][INFO ][main][GridCacheDatabaseSharedManager] Binary memory state restored at node startup [restoredPtr=FileWALPointer [idx=10511, fileOff=51410110, len=0]]
[2020-05-12T17:00:05,152][INFO ][main][FileWriteAheadLogManager] Resuming logging to WAL segment [file=/appdata/ignite/db/wal/24/0000000000000001.wal, offset=51410110, ver=2]
[2020-05-12T17:00:06,448][INFO ][main][PageMemoryImpl] Started page memory [memoryAllocated=200.0 GiB, pages=50821088, tableSize=3.9 GiB, checkpointBuffer=2.0 GiB]
[2020-05-12T17:02:08,528][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO_NEW, id=-189779360, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T17:50:44,341][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO_LINE, id=-1588248812, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T17:50:44,366][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=ignite-sys-cache, id=-2100569601, dataRegionName=sysMemPlc, mode=REPLICATED, atomicity=TRANSACTIONAL, backups=2147483647, mvcc=false]
[2020-05-12T18:17:57,071][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO_LINE_NEW, id=1742991829, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:19:54,910][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=PI_COM_DAY, id=-1904194728, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:19:54,949][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=PLM_ITEM, id=-1283854143, dataRegionName=default, mode=REPLICATED, atomicity=ATOMIC, backups=2147483647, mvcc=false]
[2020-05-12T18:22:53,662][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO, id=64322847, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:22:54,876][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CUST, id=1684722246, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:22:54,892][INFO ][main][GridCacheDatabaseSharedManager] Binary recovery performed in 4970233 ms.

Among them, binary recovery took 4970 seconds.

Our question is:

1.Why is the start time so long?

2.Is the current state of ignite, with the growth of single node data volume, the restart time will be longer and longer?

3.Do have any suggestions for optimizing the restart time?

38797715 38797715
Reply | Threaded
Open this post in threaded view
|

Re: Binary recovery for a very long time

Hi Evgenii,

The storage used is not SSD.

We will use different versions of ignite for further testing, such as ignite2.8.
Ignite is configured as follows:

<?xml version="1.0" encoding="UTF-8"?>
<bean id="ignite.cfg" class="org.apache.ignite.configuration.IgniteConfiguration">
<property name="peerClassLoadingEnabled" value="true"/>
<property name="consistentId" value="20"/>
<property name="failureDetectionTimeout" value="120000"/>
<property name="workDirectory" value="/appdata/ignite"/>
<property name="rebalanceBatchSize" value="#{2 * 1024 * 1024}"/>
<property name="rebalanceThrottle" value="100"/>
<property name="rebalanceThreadPoolSize" value="4"/>
<property name="gridLogger">
<bean class="org.apache.ignite.logger.log4j2.Log4J2Logger">
<constructor-arg type="java.lang.String" value="config/ignite-log4j2.xml"/>
</bean>
</property>
<property name="cacheConfiguration">
<list>
<bean id="partitioned-cache-template" abstract="true" class="org.apache.ignite.configuration.CacheConfiguration">
<property name="name" value="cache-partitioned*"/>
<property name="cacheMode" value="PARTITIONED" />
<property name="backups" value="1" />
<property name="queryParallelism" value="16"/>
<property name="partitionLossPolicy" value="READ_ONLY_SAFE"/>
</bean>
<bean id="replicated-cache-template" abstract="true" class="org.apache.ignite.configuration.CacheConfiguration">
<property name="name" value="cache-replicated*"/>
<property name="cacheMode" value="REPLICATED" />
<property name="partitionLossPolicy" value="READ_ONLY_SAFE"/>
</bean>
</list>
</property>
<!-- Enabling Apache Ignite Persistent Store. -->
<property name="dataStorageConfiguration">
<bean class="org.apache.ignite.configuration.DataStorageConfiguration">
<property name="defaultDataRegionConfiguration">
<bean class="org.apache.ignite.configuration.DataRegionConfiguration">
<property name="persistenceEnabled" value="true"/>
<property name="maxSize" value="#{200L * 1024 * 1024 * 1024}"/>
</bean>
</property>
</bean>
</property>
</bean>
</beans>
在 2020/5/13 上午4:45, Evgenii Zhuravlev 写道:
Hi,

Can you share full logs and configuration? What disk so you use?

Evgenii

вт, 12 мая 2020 г. в 06:49, 38797715 <[hidden email]>:

Among them:
CO_CO_NEW: ~ 48 minutes(partitioned,backup=1,33M)

Ignite sys cache: ~ 27 minutes

PLM_ITEM:~3 minutes(repicated,1.9K)


在 2020/5/12 下午9:08, 38797715 写道:

Hi community,

We have 5 servers, 16 cores, 256g memory, and 200g off-heap memory.
We have 7 tables to test, and the data volume is respectively:31.8M,495.2M,552.3M,33M,873.3K,28M,1.9K(replicated),others are partitioned(backup = 1)

VM args:-server -Xms20g -Xmx20g -XX:+AlwaysPreTouch -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M -Xloggc:/data/gc/logs/gclog.txt -Djava.net.preferIPv4Stack=true -XX:MaxDirectMemorySize=256M -XX:+PrintAdaptiveSizePolicy

Today, one of the servers was restarted(kill and then start ignite.sh) for some reason, but the node took 1.5 hours to start, which was much longer than expected.

After analyzing the log, the following information is found:

[2020-05-12T17:00:05,138][INFO ][main][GridCacheDatabaseSharedManager] Found last checkpoint marker [cpId=7a0564f2-43e5-400b-9439-746fc68a6ccb, pos=FileWALPointer [idx=10511, fileOff=51348888, len=61193]]
[2020-05-12T17:00:05,151][INFO ][main][GridCacheDatabaseSharedManager] Binary memory state restored at node startup [restoredPtr=FileWALPointer [idx=10511, fileOff=51410110, len=0]]
[2020-05-12T17:00:05,152][INFO ][main][FileWriteAheadLogManager] Resuming logging to WAL segment [file=/appdata/ignite/db/wal/24/0000000000000001.wal, offset=51410110, ver=2]
[2020-05-12T17:00:06,448][INFO ][main][PageMemoryImpl] Started page memory [memoryAllocated=200.0 GiB, pages=50821088, tableSize=3.9 GiB, checkpointBuffer=2.0 GiB]
[2020-05-12T17:02:08,528][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO_NEW, id=-189779360, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T17:50:44,341][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO_LINE, id=-1588248812, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T17:50:44,366][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=ignite-sys-cache, id=-2100569601, dataRegionName=sysMemPlc, mode=REPLICATED, atomicity=TRANSACTIONAL, backups=2147483647, mvcc=false]
[2020-05-12T18:17:57,071][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO_LINE_NEW, id=1742991829, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:19:54,910][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=PI_COM_DAY, id=-1904194728, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:19:54,949][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=PLM_ITEM, id=-1283854143, dataRegionName=default, mode=REPLICATED, atomicity=ATOMIC, backups=2147483647, mvcc=false]
[2020-05-12T18:22:53,662][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO, id=64322847, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:22:54,876][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CUST, id=1684722246, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:22:54,892][INFO ][main][GridCacheDatabaseSharedManager] Binary recovery performed in 4970233 ms.

Among them, binary recovery took 4970 seconds.

Our question is:

1.Why is the start time so long?

2.Is the current state of ignite, with the growth of single node data volume, the restart time will be longer and longer?

3.Do have any suggestions for optimizing the restart time?

ezhuravlev ezhuravlev
Reply | Threaded
Open this post in threaded view
|

Re: Binary recovery for a very long time

Can you share full logs from all nodes?

вт, 12 мая 2020 г. в 18:24, 38797715 <[hidden email]>:

Hi Evgenii,

The storage used is not SSD.

We will use different versions of ignite for further testing, such as ignite2.8.
Ignite is configured as follows:

<?xml version="1.0" encoding="UTF-8"?>
<bean id="ignite.cfg" class="org.apache.ignite.configuration.IgniteConfiguration">
<property name="peerClassLoadingEnabled" value="true"/>
<property name="consistentId" value="20"/>
<property name="failureDetectionTimeout" value="120000"/>
<property name="workDirectory" value="/appdata/ignite"/>
<property name="rebalanceBatchSize" value="#{2 * 1024 * 1024}"/>
<property name="rebalanceThrottle" value="100"/>
<property name="rebalanceThreadPoolSize" value="4"/>
<property name="gridLogger">
<bean class="org.apache.ignite.logger.log4j2.Log4J2Logger">
<constructor-arg type="java.lang.String" value="config/ignite-log4j2.xml"/>
</bean>
</property>
<property name="cacheConfiguration">
<list>
<bean id="partitioned-cache-template" abstract="true" class="org.apache.ignite.configuration.CacheConfiguration">
<property name="name" value="cache-partitioned*"/>
<property name="cacheMode" value="PARTITIONED" />
<property name="backups" value="1" />
<property name="queryParallelism" value="16"/>
<property name="partitionLossPolicy" value="READ_ONLY_SAFE"/>
</bean>
<bean id="replicated-cache-template" abstract="true" class="org.apache.ignite.configuration.CacheConfiguration">
<property name="name" value="cache-replicated*"/>
<property name="cacheMode" value="REPLICATED" />
<property name="partitionLossPolicy" value="READ_ONLY_SAFE"/>
</bean>
</list>
</property>
<!-- Enabling Apache Ignite Persistent Store. -->
<property name="dataStorageConfiguration">
<bean class="org.apache.ignite.configuration.DataStorageConfiguration">
<property name="defaultDataRegionConfiguration">
<bean class="org.apache.ignite.configuration.DataRegionConfiguration">
<property name="persistenceEnabled" value="true"/>
<property name="maxSize" value="#{200L * 1024 * 1024 * 1024}"/>
</bean>
</property>
</bean>
</property>
</bean>
</beans>
在 2020/5/13 上午4:45, Evgenii Zhuravlev 写道:
Hi,

Can you share full logs and configuration? What disk so you use?

Evgenii

вт, 12 мая 2020 г. в 06:49, 38797715 <[hidden email]>:

Among them:
CO_CO_NEW: ~ 48 minutes(partitioned,backup=1,33M)

Ignite sys cache: ~ 27 minutes

PLM_ITEM:~3 minutes(repicated,1.9K)


在 2020/5/12 下午9:08, 38797715 写道:

Hi community,

We have 5 servers, 16 cores, 256g memory, and 200g off-heap memory.
We have 7 tables to test, and the data volume is respectively:31.8M,495.2M,552.3M,33M,873.3K,28M,1.9K(replicated),others are partitioned(backup = 1)

VM args:-server -Xms20g -Xmx20g -XX:+AlwaysPreTouch -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M -Xloggc:/data/gc/logs/gclog.txt -Djava.net.preferIPv4Stack=true -XX:MaxDirectMemorySize=256M -XX:+PrintAdaptiveSizePolicy

Today, one of the servers was restarted(kill and then start ignite.sh) for some reason, but the node took 1.5 hours to start, which was much longer than expected.

After analyzing the log, the following information is found:

[2020-05-12T17:00:05,138][INFO ][main][GridCacheDatabaseSharedManager] Found last checkpoint marker [cpId=7a0564f2-43e5-400b-9439-746fc68a6ccb, pos=FileWALPointer [idx=10511, fileOff=51348888, len=61193]]
[2020-05-12T17:00:05,151][INFO ][main][GridCacheDatabaseSharedManager] Binary memory state restored at node startup [restoredPtr=FileWALPointer [idx=10511, fileOff=51410110, len=0]]
[2020-05-12T17:00:05,152][INFO ][main][FileWriteAheadLogManager] Resuming logging to WAL segment [file=/appdata/ignite/db/wal/24/0000000000000001.wal, offset=51410110, ver=2]
[2020-05-12T17:00:06,448][INFO ][main][PageMemoryImpl] Started page memory [memoryAllocated=200.0 GiB, pages=50821088, tableSize=3.9 GiB, checkpointBuffer=2.0 GiB]
[2020-05-12T17:02:08,528][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO_NEW, id=-189779360, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T17:50:44,341][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO_LINE, id=-1588248812, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T17:50:44,366][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=ignite-sys-cache, id=-2100569601, dataRegionName=sysMemPlc, mode=REPLICATED, atomicity=TRANSACTIONAL, backups=2147483647, mvcc=false]
[2020-05-12T18:17:57,071][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO_LINE_NEW, id=1742991829, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:19:54,910][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=PI_COM_DAY, id=-1904194728, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:19:54,949][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=PLM_ITEM, id=-1283854143, dataRegionName=default, mode=REPLICATED, atomicity=ATOMIC, backups=2147483647, mvcc=false]
[2020-05-12T18:22:53,662][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO, id=64322847, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:22:54,876][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CUST, id=1684722246, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:22:54,892][INFO ][main][GridCacheDatabaseSharedManager] Binary recovery performed in 4970233 ms.

Among them, binary recovery took 4970 seconds.

Our question is:

1.Why is the start time so long?

2.Is the current state of ignite, with the growth of single node data volume, the restart time will be longer and longer?

3.Do have any suggestions for optimizing the restart time?

38797715 38797715
Reply | Threaded
Open this post in threaded view
|

Re: Binary recovery for a very long time

Hi,

If direct IO is disabled, the startup speed will be doubled, including some other tests. I find that direct IO has a great impact on the read performance.

在 2020/5/14 上午5:16, Evgenii Zhuravlev 写道:
Can you share full logs from all nodes?

вт, 12 мая 2020 г. в 18:24, 38797715 <[hidden email]>:

Hi Evgenii,

The storage used is not SSD.

We will use different versions of ignite for further testing, such as ignite2.8.
Ignite is configured as follows:

<?xml version="1.0" encoding="UTF-8"?>
<bean id="ignite.cfg" class="org.apache.ignite.configuration.IgniteConfiguration">
<property name="peerClassLoadingEnabled" value="true"/>
<property name="consistentId" value="20"/>
<property name="failureDetectionTimeout" value="120000"/>
<property name="workDirectory" value="/appdata/ignite"/>
<property name="rebalanceBatchSize" value="#{2 * 1024 * 1024}"/>
<property name="rebalanceThrottle" value="100"/>
<property name="rebalanceThreadPoolSize" value="4"/>
<property name="gridLogger">
<bean class="org.apache.ignite.logger.log4j2.Log4J2Logger">
<constructor-arg type="java.lang.String" value="config/ignite-log4j2.xml"/>
</bean>
</property>
<property name="cacheConfiguration">
<list>
<bean id="partitioned-cache-template" abstract="true" class="org.apache.ignite.configuration.CacheConfiguration">
<property name="name" value="cache-partitioned*"/>
<property name="cacheMode" value="PARTITIONED" />
<property name="backups" value="1" />
<property name="queryParallelism" value="16"/>
<property name="partitionLossPolicy" value="READ_ONLY_SAFE"/>
</bean>
<bean id="replicated-cache-template" abstract="true" class="org.apache.ignite.configuration.CacheConfiguration">
<property name="name" value="cache-replicated*"/>
<property name="cacheMode" value="REPLICATED" />
<property name="partitionLossPolicy" value="READ_ONLY_SAFE"/>
</bean>
</list>
</property>
<!-- Enabling Apache Ignite Persistent Store. -->
<property name="dataStorageConfiguration">
<bean class="org.apache.ignite.configuration.DataStorageConfiguration">
<property name="defaultDataRegionConfiguration">
<bean class="org.apache.ignite.configuration.DataRegionConfiguration">
<property name="persistenceEnabled" value="true"/>
<property name="maxSize" value="#{200L * 1024 * 1024 * 1024}"/>
</bean>
</property>
</bean>
</property>
</bean>
</beans>
在 2020/5/13 上午4:45, Evgenii Zhuravlev 写道:
Hi,

Can you share full logs and configuration? What disk so you use?

Evgenii

вт, 12 мая 2020 г. в 06:49, 38797715 <[hidden email]>:

Among them:
CO_CO_NEW: ~ 48 minutes(partitioned,backup=1,33M)

Ignite sys cache: ~ 27 minutes

PLM_ITEM:~3 minutes(repicated,1.9K)


在 2020/5/12 下午9:08, 38797715 写道:

Hi community,

We have 5 servers, 16 cores, 256g memory, and 200g off-heap memory.
We have 7 tables to test, and the data volume is respectively:31.8M,495.2M,552.3M,33M,873.3K,28M,1.9K(replicated),others are partitioned(backup = 1)

VM args:-server -Xms20g -Xmx20g -XX:+AlwaysPreTouch -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M -Xloggc:/data/gc/logs/gclog.txt -Djava.net.preferIPv4Stack=true -XX:MaxDirectMemorySize=256M -XX:+PrintAdaptiveSizePolicy

Today, one of the servers was restarted(kill and then start ignite.sh) for some reason, but the node took 1.5 hours to start, which was much longer than expected.

After analyzing the log, the following information is found:

[2020-05-12T17:00:05,138][INFO ][main][GridCacheDatabaseSharedManager] Found last checkpoint marker [cpId=7a0564f2-43e5-400b-9439-746fc68a6ccb, pos=FileWALPointer [idx=10511, fileOff=51348888, len=61193]]
[2020-05-12T17:00:05,151][INFO ][main][GridCacheDatabaseSharedManager] Binary memory state restored at node startup [restoredPtr=FileWALPointer [idx=10511, fileOff=51410110, len=0]]
[2020-05-12T17:00:05,152][INFO ][main][FileWriteAheadLogManager] Resuming logging to WAL segment [file=/appdata/ignite/db/wal/24/0000000000000001.wal, offset=51410110, ver=2]
[2020-05-12T17:00:06,448][INFO ][main][PageMemoryImpl] Started page memory [memoryAllocated=200.0 GiB, pages=50821088, tableSize=3.9 GiB, checkpointBuffer=2.0 GiB]
[2020-05-12T17:02:08,528][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO_NEW, id=-189779360, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T17:50:44,341][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO_LINE, id=-1588248812, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T17:50:44,366][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=ignite-sys-cache, id=-2100569601, dataRegionName=sysMemPlc, mode=REPLICATED, atomicity=TRANSACTIONAL, backups=2147483647, mvcc=false]
[2020-05-12T18:17:57,071][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO_LINE_NEW, id=1742991829, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:19:54,910][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=PI_COM_DAY, id=-1904194728, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:19:54,949][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=PLM_ITEM, id=-1283854143, dataRegionName=default, mode=REPLICATED, atomicity=ATOMIC, backups=2147483647, mvcc=false]
[2020-05-12T18:22:53,662][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO, id=64322847, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:22:54,876][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CUST, id=1684722246, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:22:54,892][INFO ][main][GridCacheDatabaseSharedManager] Binary recovery performed in 4970233 ms.

Among them, binary recovery took 4970 seconds.

Our question is:

1.Why is the start time so long?

2.Is the current state of ignite, with the growth of single node data volume, the restart time will be longer and longer?

3.Do have any suggestions for optimizing the restart time?

ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Binary recovery for a very long time

Hello!

Direct IO module is experimental and should not be used unless performance is tested first, in your specific use case.

Regards,
--
Ilya Kasnacheev


пн, 18 мая 2020 г. в 16:47, 38797715 <[hidden email]>:

Hi,

If direct IO is disabled, the startup speed will be doubled, including some other tests. I find that direct IO has a great impact on the read performance.

在 2020/5/14 上午5:16, Evgenii Zhuravlev 写道:
Can you share full logs from all nodes?

вт, 12 мая 2020 г. в 18:24, 38797715 <[hidden email]>:

Hi Evgenii,

The storage used is not SSD.

We will use different versions of ignite for further testing, such as ignite2.8.
Ignite is configured as follows:

<?xml version="1.0" encoding="UTF-8"?>
<bean id="ignite.cfg" class="org.apache.ignite.configuration.IgniteConfiguration">
<property name="peerClassLoadingEnabled" value="true"/>
<property name="consistentId" value="20"/>
<property name="failureDetectionTimeout" value="120000"/>
<property name="workDirectory" value="/appdata/ignite"/>
<property name="rebalanceBatchSize" value="#{2 * 1024 * 1024}"/>
<property name="rebalanceThrottle" value="100"/>
<property name="rebalanceThreadPoolSize" value="4"/>
<property name="gridLogger">
<bean class="org.apache.ignite.logger.log4j2.Log4J2Logger">
<constructor-arg type="java.lang.String" value="config/ignite-log4j2.xml"/>
</bean>
</property>
<property name="cacheConfiguration">
<list>
<bean id="partitioned-cache-template" abstract="true" class="org.apache.ignite.configuration.CacheConfiguration">
<property name="name" value="cache-partitioned*"/>
<property name="cacheMode" value="PARTITIONED" />
<property name="backups" value="1" />
<property name="queryParallelism" value="16"/>
<property name="partitionLossPolicy" value="READ_ONLY_SAFE"/>
</bean>
<bean id="replicated-cache-template" abstract="true" class="org.apache.ignite.configuration.CacheConfiguration">
<property name="name" value="cache-replicated*"/>
<property name="cacheMode" value="REPLICATED" />
<property name="partitionLossPolicy" value="READ_ONLY_SAFE"/>
</bean>
</list>
</property>
<!-- Enabling Apache Ignite Persistent Store. -->
<property name="dataStorageConfiguration">
<bean class="org.apache.ignite.configuration.DataStorageConfiguration">
<property name="defaultDataRegionConfiguration">
<bean class="org.apache.ignite.configuration.DataRegionConfiguration">
<property name="persistenceEnabled" value="true"/>
<property name="maxSize" value="#{200L * 1024 * 1024 * 1024}"/>
</bean>
</property>
</bean>
</property>
</bean>
</beans>
在 2020/5/13 上午4:45, Evgenii Zhuravlev 写道:
Hi,

Can you share full logs and configuration? What disk so you use?

Evgenii

вт, 12 мая 2020 г. в 06:49, 38797715 <[hidden email]>:

Among them:
CO_CO_NEW: ~ 48 minutes(partitioned,backup=1,33M)

Ignite sys cache: ~ 27 minutes

PLM_ITEM:~3 minutes(repicated,1.9K)


在 2020/5/12 下午9:08, 38797715 写道:

Hi community,

We have 5 servers, 16 cores, 256g memory, and 200g off-heap memory.
We have 7 tables to test, and the data volume is respectively:31.8M,495.2M,552.3M,33M,873.3K,28M,1.9K(replicated),others are partitioned(backup = 1)

VM args:-server -Xms20g -Xmx20g -XX:+AlwaysPreTouch -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M -Xloggc:/data/gc/logs/gclog.txt -Djava.net.preferIPv4Stack=true -XX:MaxDirectMemorySize=256M -XX:+PrintAdaptiveSizePolicy

Today, one of the servers was restarted(kill and then start ignite.sh) for some reason, but the node took 1.5 hours to start, which was much longer than expected.

After analyzing the log, the following information is found:

[2020-05-12T17:00:05,138][INFO ][main][GridCacheDatabaseSharedManager] Found last checkpoint marker [cpId=7a0564f2-43e5-400b-9439-746fc68a6ccb, pos=FileWALPointer [idx=10511, fileOff=51348888, len=61193]]
[2020-05-12T17:00:05,151][INFO ][main][GridCacheDatabaseSharedManager] Binary memory state restored at node startup [restoredPtr=FileWALPointer [idx=10511, fileOff=51410110, len=0]]
[2020-05-12T17:00:05,152][INFO ][main][FileWriteAheadLogManager] Resuming logging to WAL segment [file=/appdata/ignite/db/wal/24/0000000000000001.wal, offset=51410110, ver=2]
[2020-05-12T17:00:06,448][INFO ][main][PageMemoryImpl] Started page memory [memoryAllocated=200.0 GiB, pages=50821088, tableSize=3.9 GiB, checkpointBuffer=2.0 GiB]
[2020-05-12T17:02:08,528][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO_NEW, id=-189779360, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T17:50:44,341][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO_LINE, id=-1588248812, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T17:50:44,366][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=ignite-sys-cache, id=-2100569601, dataRegionName=sysMemPlc, mode=REPLICATED, atomicity=TRANSACTIONAL, backups=2147483647, mvcc=false]
[2020-05-12T18:17:57,071][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO_LINE_NEW, id=1742991829, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:19:54,910][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=PI_COM_DAY, id=-1904194728, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:19:54,949][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=PLM_ITEM, id=-1283854143, dataRegionName=default, mode=REPLICATED, atomicity=ATOMIC, backups=2147483647, mvcc=false]
[2020-05-12T18:22:53,662][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO, id=64322847, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:22:54,876][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CUST, id=1684722246, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:22:54,892][INFO ][main][GridCacheDatabaseSharedManager] Binary recovery performed in 4970233 ms.

Among them, binary recovery took 4970 seconds.

Our question is:

1.Why is the start time so long?

2.Is the current state of ignite, with the growth of single node data volume, the restart time will be longer and longer?

3.Do have any suggestions for optimizing the restart time?

38797715 38797715
Reply | Threaded
Open this post in threaded view
|

Re: Binary recovery for a very long time

Hi,

the following log message:

[2020-05-12T18:17:57,071][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO_LINE_NEW, id=1742991829, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]

I have the following questions:

1.What has been done in the startup cache in recovery mode?

2.After testing, if the node stops normally (non abnormal shutdown), the recovery process will also be performed during startup. Why?

在 2020/5/18 下午9:58, Ilya Kasnacheev 写道:
Hello!

Direct IO module is experimental and should not be used unless performance is tested first, in your specific use case.

Regards,
--
Ilya Kasnacheev


пн, 18 мая 2020 г. в 16:47, 38797715 <[hidden email]>:

Hi,

If direct IO is disabled, the startup speed will be doubled, including some other tests. I find that direct IO has a great impact on the read performance.

在 2020/5/14 上午5:16, Evgenii Zhuravlev 写道:
Can you share full logs from all nodes?

вт, 12 мая 2020 г. в 18:24, 38797715 <[hidden email]>:

Hi Evgenii,

The storage used is not SSD.

We will use different versions of ignite for further testing, such as ignite2.8.
Ignite is configured as follows:

<?xml version="1.0" encoding="UTF-8"?>
<bean id="ignite.cfg" class="org.apache.ignite.configuration.IgniteConfiguration">
<property name="peerClassLoadingEnabled" value="true"/>
<property name="consistentId" value="20"/>
<property name="failureDetectionTimeout" value="120000"/>
<property name="workDirectory" value="/appdata/ignite"/>
<property name="rebalanceBatchSize" value="#{2 * 1024 * 1024}"/>
<property name="rebalanceThrottle" value="100"/>
<property name="rebalanceThreadPoolSize" value="4"/>
<property name="gridLogger">
<bean class="org.apache.ignite.logger.log4j2.Log4J2Logger">
<constructor-arg type="java.lang.String" value="config/ignite-log4j2.xml"/>
</bean>
</property>
<property name="cacheConfiguration">
<list>
<bean id="partitioned-cache-template" abstract="true" class="org.apache.ignite.configuration.CacheConfiguration">
<property name="name" value="cache-partitioned*"/>
<property name="cacheMode" value="PARTITIONED" />
<property name="backups" value="1" />
<property name="queryParallelism" value="16"/>
<property name="partitionLossPolicy" value="READ_ONLY_SAFE"/>
</bean>
<bean id="replicated-cache-template" abstract="true" class="org.apache.ignite.configuration.CacheConfiguration">
<property name="name" value="cache-replicated*"/>
<property name="cacheMode" value="REPLICATED" />
<property name="partitionLossPolicy" value="READ_ONLY_SAFE"/>
</bean>
</list>
</property>
<!-- Enabling Apache Ignite Persistent Store. -->
<property name="dataStorageConfiguration">
<bean class="org.apache.ignite.configuration.DataStorageConfiguration">
<property name="defaultDataRegionConfiguration">
<bean class="org.apache.ignite.configuration.DataRegionConfiguration">
<property name="persistenceEnabled" value="true"/>
<property name="maxSize" value="#{200L * 1024 * 1024 * 1024}"/>
</bean>
</property>
</bean>
</property>
</bean>
</beans>
在 2020/5/13 上午4:45, Evgenii Zhuravlev 写道:
Hi,

Can you share full logs and configuration? What disk so you use?

Evgenii

вт, 12 мая 2020 г. в 06:49, 38797715 <[hidden email]>:

Among them:
CO_CO_NEW: ~ 48 minutes(partitioned,backup=1,33M)

Ignite sys cache: ~ 27 minutes

PLM_ITEM:~3 minutes(repicated,1.9K)


在 2020/5/12 下午9:08, 38797715 写道:

Hi community,

We have 5 servers, 16 cores, 256g memory, and 200g off-heap memory.
We have 7 tables to test, and the data volume is respectively:31.8M,495.2M,552.3M,33M,873.3K,28M,1.9K(replicated),others are partitioned(backup = 1)

VM args:-server -Xms20g -Xmx20g -XX:+AlwaysPreTouch -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M -Xloggc:/data/gc/logs/gclog.txt -Djava.net.preferIPv4Stack=true -XX:MaxDirectMemorySize=256M -XX:+PrintAdaptiveSizePolicy

Today, one of the servers was restarted(kill and then start ignite.sh) for some reason, but the node took 1.5 hours to start, which was much longer than expected.

After analyzing the log, the following information is found:

[2020-05-12T17:00:05,138][INFO ][main][GridCacheDatabaseSharedManager] Found last checkpoint marker [cpId=7a0564f2-43e5-400b-9439-746fc68a6ccb, pos=FileWALPointer [idx=10511, fileOff=51348888, len=61193]]
[2020-05-12T17:00:05,151][INFO ][main][GridCacheDatabaseSharedManager] Binary memory state restored at node startup [restoredPtr=FileWALPointer [idx=10511, fileOff=51410110, len=0]]
[2020-05-12T17:00:05,152][INFO ][main][FileWriteAheadLogManager] Resuming logging to WAL segment [file=/appdata/ignite/db/wal/24/0000000000000001.wal, offset=51410110, ver=2]
[2020-05-12T17:00:06,448][INFO ][main][PageMemoryImpl] Started page memory [memoryAllocated=200.0 GiB, pages=50821088, tableSize=3.9 GiB, checkpointBuffer=2.0 GiB]
[2020-05-12T17:02:08,528][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO_NEW, id=-189779360, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T17:50:44,341][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO_LINE, id=-1588248812, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T17:50:44,366][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=ignite-sys-cache, id=-2100569601, dataRegionName=sysMemPlc, mode=REPLICATED, atomicity=TRANSACTIONAL, backups=2147483647, mvcc=false]
[2020-05-12T18:17:57,071][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO_LINE_NEW, id=1742991829, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:19:54,910][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=PI_COM_DAY, id=-1904194728, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:19:54,949][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=PLM_ITEM, id=-1283854143, dataRegionName=default, mode=REPLICATED, atomicity=ATOMIC, backups=2147483647, mvcc=false]
[2020-05-12T18:22:53,662][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO, id=64322847, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:22:54,876][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CUST, id=1684722246, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:22:54,892][INFO ][main][GridCacheDatabaseSharedManager] Binary recovery performed in 4970233 ms.

Among them, binary recovery took 4970 seconds.

Our question is:

1.Why is the start time so long?

2.Is the current state of ignite, with the growth of single node data volume, the restart time will be longer and longer?

3.Do have any suggestions for optimizing the restart time?

ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Binary recovery for a very long time

Hello!

1. I guess that WAL is read.
2. Unfortunately we do not have truly graceful exit as far as my understanding goes.

Regards,
--
Ilya Kasnacheev


вт, 19 мая 2020 г. в 10:22, 38797715 <[hidden email]>:

Hi,

the following log message:

[2020-05-12T18:17:57,071][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO_LINE_NEW, id=1742991829, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]

I have the following questions:

1.What has been done in the startup cache in recovery mode?

2.After testing, if the node stops normally (non abnormal shutdown), the recovery process will also be performed during startup. Why?

在 2020/5/18 下午9:58, Ilya Kasnacheev 写道:
Hello!

Direct IO module is experimental and should not be used unless performance is tested first, in your specific use case.

Regards,
--
Ilya Kasnacheev


пн, 18 мая 2020 г. в 16:47, 38797715 <[hidden email]>:

Hi,

If direct IO is disabled, the startup speed will be doubled, including some other tests. I find that direct IO has a great impact on the read performance.

在 2020/5/14 上午5:16, Evgenii Zhuravlev 写道:
Can you share full logs from all nodes?

вт, 12 мая 2020 г. в 18:24, 38797715 <[hidden email]>:

Hi Evgenii,

The storage used is not SSD.

We will use different versions of ignite for further testing, such as ignite2.8.
Ignite is configured as follows:

<?xml version="1.0" encoding="UTF-8"?>
<bean id="ignite.cfg" class="org.apache.ignite.configuration.IgniteConfiguration">
<property name="peerClassLoadingEnabled" value="true"/>
<property name="consistentId" value="20"/>
<property name="failureDetectionTimeout" value="120000"/>
<property name="workDirectory" value="/appdata/ignite"/>
<property name="rebalanceBatchSize" value="#{2 * 1024 * 1024}"/>
<property name="rebalanceThrottle" value="100"/>
<property name="rebalanceThreadPoolSize" value="4"/>
<property name="gridLogger">
<bean class="org.apache.ignite.logger.log4j2.Log4J2Logger">
<constructor-arg type="java.lang.String" value="config/ignite-log4j2.xml"/>
</bean>
</property>
<property name="cacheConfiguration">
<list>
<bean id="partitioned-cache-template" abstract="true" class="org.apache.ignite.configuration.CacheConfiguration">
<property name="name" value="cache-partitioned*"/>
<property name="cacheMode" value="PARTITIONED" />
<property name="backups" value="1" />
<property name="queryParallelism" value="16"/>
<property name="partitionLossPolicy" value="READ_ONLY_SAFE"/>
</bean>
<bean id="replicated-cache-template" abstract="true" class="org.apache.ignite.configuration.CacheConfiguration">
<property name="name" value="cache-replicated*"/>
<property name="cacheMode" value="REPLICATED" />
<property name="partitionLossPolicy" value="READ_ONLY_SAFE"/>
</bean>
</list>
</property>
<!-- Enabling Apache Ignite Persistent Store. -->
<property name="dataStorageConfiguration">
<bean class="org.apache.ignite.configuration.DataStorageConfiguration">
<property name="defaultDataRegionConfiguration">
<bean class="org.apache.ignite.configuration.DataRegionConfiguration">
<property name="persistenceEnabled" value="true"/>
<property name="maxSize" value="#{200L * 1024 * 1024 * 1024}"/>
</bean>
</property>
</bean>
</property>
</bean>
</beans>
在 2020/5/13 上午4:45, Evgenii Zhuravlev 写道:
Hi,

Can you share full logs and configuration? What disk so you use?

Evgenii

вт, 12 мая 2020 г. в 06:49, 38797715 <[hidden email]>:

Among them:
CO_CO_NEW: ~ 48 minutes(partitioned,backup=1,33M)

Ignite sys cache: ~ 27 minutes

PLM_ITEM:~3 minutes(repicated,1.9K)


在 2020/5/12 下午9:08, 38797715 写道:

Hi community,

We have 5 servers, 16 cores, 256g memory, and 200g off-heap memory.
We have 7 tables to test, and the data volume is respectively:31.8M,495.2M,552.3M,33M,873.3K,28M,1.9K(replicated),others are partitioned(backup = 1)

VM args:-server -Xms20g -Xmx20g -XX:+AlwaysPreTouch -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M -Xloggc:/data/gc/logs/gclog.txt -Djava.net.preferIPv4Stack=true -XX:MaxDirectMemorySize=256M -XX:+PrintAdaptiveSizePolicy

Today, one of the servers was restarted(kill and then start ignite.sh) for some reason, but the node took 1.5 hours to start, which was much longer than expected.

After analyzing the log, the following information is found:

[2020-05-12T17:00:05,138][INFO ][main][GridCacheDatabaseSharedManager] Found last checkpoint marker [cpId=7a0564f2-43e5-400b-9439-746fc68a6ccb, pos=FileWALPointer [idx=10511, fileOff=51348888, len=61193]]
[2020-05-12T17:00:05,151][INFO ][main][GridCacheDatabaseSharedManager] Binary memory state restored at node startup [restoredPtr=FileWALPointer [idx=10511, fileOff=51410110, len=0]]
[2020-05-12T17:00:05,152][INFO ][main][FileWriteAheadLogManager] Resuming logging to WAL segment [file=/appdata/ignite/db/wal/24/0000000000000001.wal, offset=51410110, ver=2]
[2020-05-12T17:00:06,448][INFO ][main][PageMemoryImpl] Started page memory [memoryAllocated=200.0 GiB, pages=50821088, tableSize=3.9 GiB, checkpointBuffer=2.0 GiB]
[2020-05-12T17:02:08,528][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO_NEW, id=-189779360, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T17:50:44,341][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO_LINE, id=-1588248812, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T17:50:44,366][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=ignite-sys-cache, id=-2100569601, dataRegionName=sysMemPlc, mode=REPLICATED, atomicity=TRANSACTIONAL, backups=2147483647, mvcc=false]
[2020-05-12T18:17:57,071][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO_LINE_NEW, id=1742991829, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:19:54,910][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=PI_COM_DAY, id=-1904194728, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:19:54,949][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=PLM_ITEM, id=-1283854143, dataRegionName=default, mode=REPLICATED, atomicity=ATOMIC, backups=2147483647, mvcc=false]
[2020-05-12T18:22:53,662][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO, id=64322847, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:22:54,876][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CUST, id=1684722246, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:22:54,892][INFO ][main][GridCacheDatabaseSharedManager] Binary recovery performed in 4970233 ms.

Among them, binary recovery took 4970 seconds.

Our question is:

1.Why is the start time so long?

2.Is the current state of ignite, with the growth of single node data volume, the restart time will be longer and longer?

3.Do have any suggestions for optimizing the restart time?

38797715 38797715
Reply | Threaded
Open this post in threaded view
|

Re: Binary recovery for a very long time

Hi,

I enabled debug logging and found the following log output:

[2020-05-23T21:43:58,397][DEBUG][nio-acceptor-tcp-comm-#28%ClusterName1%][TcpCommunicationSpi] Balancing data [min0=0, minIdx=0, max0=-1, maxIdx=-1]
[2020-05-23T21:43:58,405][DEBUG][main][SchemaManager] Creating DB table with SQL: CREATE TABLE "PUBLIC"."CO_CUST" (_KEY VARCHAR INVISIBLE NOT NULL,_VAL OTHER INVISIBLE,"CUST_ID" VARCHAR(30) NOT NULL,"CUST_NAME" VARCHAR(200),"CUST_SHORT_NAME" VARCHAR(200),"CUST_SHORT_ID" VARCHAR(240),"LICENSE_CODE" VARCHAR(30),"STATUS" VARCHAR(16),"COM_ID" VARCHAR(30),"SALE_CENTER_ID" VARCHAR(30),"SALE_DEPT_ID" VARCHAR(30),"SLSMGR_ID" VARCHAR(30),"SLSMAN_ID" VARCHAR(30),"SLSMAN_MOBILE" VARCHAR(16),"MANAGER" VARCHAR(100),"IDENTITY_CARD_ID" VARCHAR(36),"ORDER_TEL" VARCHAR(80),"INV_TYPE" VARCHAR(18),"ORDER_WAY" VARCHAR(18),"PAY_TYPE" VARCHAR(2),"PERIODS" VARCHAR(200),"PRD_ST_DATE" VARCHAR(8),"ORDER_CUST_ID" VARCHAR(30),"BUSI_ADDR" VARCHAR(400),"WORK_PORT" VARCHAR(18),"BASE_TYPE" VARCHAR(24),"SALE_SCOPE" VARCHAR(1),"SCOPE" VARCHAR(200),"COM_CHARA" VARCHAR(2),"INNER_TYPE" VARCHAR(18),"CUST_KIND" VARCHAR(10),"CUST_KIND_NAME" VARCHAR(120),"CUST_TYPE" VARCHAR(6),"CUST_TYPE1" VARCHAR(6),"CUST_TYPE2" VARCHAR(6),"CUST_TYPE3" VARCHAR(6),"CUST_TYPE4" VARCHAR(6),"CUST_TYPE5" VARCHAR(6),"AREA_TYPE" VARCHAR(2),"IS_SEFL_CUST" VARCHAR(1),"IS_FUNC_CUST" VARCHAR(1),"MANAGER_BIRTHDAY" VARCHAR(8),"CELEBRATE_DATE" VARCHAR(8),"NATION_CUST_CODE" VARCHAR(30),"LONGITUDE" DECIMAL,"LATITUDE" DECIMAL,"AGENT_MESSAGE" VARCHAR(400),"NOTE" VARCHAR(1000),"UPDATE_TIME" TIMESTAMP,"INVTY_ID" VARCHAR(30),"STEP_ID" VARCHAR(20),"INV_CUST_ID" VARCHAR(60),"INV_UNIT_NAME" VARCHAR(200),"ACCOUNT" VARCHAR(200),"BANK" VARCHAR(100),"TAX_ACCOUNT" VARCHAR(120),"OTHER_ORDER_WAY" VARCHAR(32),"SALE_AVG" DECIMAL,"ITEM_ORD" DECIMAL,"QTY_SOLD" DECIMAL,"AMT_SOLD" DECIMAL,"MANAGER_TEL" VARCHAR(80),"IS_SALE_LARGE" VARCHAR(1),"TAX_TEL" VARCHAR(60),"TAX_ADDR" VARCHAR(400),"IS_TOR_TAX" VARCHAR(1),"CUST_TYPE6" VARCHAR(6),"CUST_TYPE7" VARCHAR(6),"CUST_TYPE8" VARCHAR(124),"CUST_TYPE9" VARCHAR(60),"CUST_TYPE10" VARCHAR(200),"CANT_ID" VARCHAR(20),"IS_ONLINE_PAY" VARCHAR(1),"IS_RAIL_CUST" VARCHAR(1),"CUST_SEG" VARCHAR(30),"QTY_MULTIPLE" VARCHAR(20),"TAX_ADRR" VARCHAR(200),"COLLECT_STAFF_ID" VARCHAR(30),"ITEM_HEIGHT" DECIMAL,"AREA_ID" VARCHAR(30),"BASE_TYPE_EXT" VARCHAR(30),"AREA_TYPE_EXT" VARCHAR(30),"WORK_PORT_EXT" VARCHAR(30),"CUST_SEG_EXT" VARCHAR(30),"IS_CIGAR_CUST" VARCHAR(1))
[2020-05-23T21:43:58,411][DEBUG][main][IgniteH2Indexing] Creating cache index [cacheId=1684722246, idxName=_key_PK]
[2020-05-23T21:43:59,081][DEBUG][main][IgniteH2Indexing] Creating cache index [cacheId=1684722246, idxName=IDX_CO_CUST_SALE_CENTER_ID]
[2020-05-23T21:43:59,088][DEBUG][main][IgniteH2Indexing] Creating cache index [cacheId=1684722246, idxName=IDX_CO_CUST_STATUS]
[2020-05-23T21:43:59,099][DEBUG][main][IgniteH2Indexing] Creating cache index [cacheId=1684722246, idxName=IDX_CO_CUST_SALE_DEPT_ID]
[2020-05-23T21:43:59,109][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CUST, id=1684722246, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]

It can be seen from the log that all tables and indexes need to be rebuilt during node startup? Just don't load data?

在 2020/5/21 下午8:48, Ilya Kasnacheev 写道:
Hello!

1. I guess that WAL is read.
2. Unfortunately we do not have truly graceful exit as far as my understanding goes.

Regards,
--
Ilya Kasnacheev


вт, 19 мая 2020 г. в 10:22, 38797715 <[hidden email]>:

Hi,

the following log message:

[2020-05-12T18:17:57,071][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO_LINE_NEW, id=1742991829, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]

I have the following questions:

1.What has been done in the startup cache in recovery mode?

2.After testing, if the node stops normally (non abnormal shutdown), the recovery process will also be performed during startup. Why?

在 2020/5/18 下午9:58, Ilya Kasnacheev 写道:
Hello!

Direct IO module is experimental and should not be used unless performance is tested first, in your specific use case.

Regards,
--
Ilya Kasnacheev


пн, 18 мая 2020 г. в 16:47, 38797715 <[hidden email]>:

Hi,

If direct IO is disabled, the startup speed will be doubled, including some other tests. I find that direct IO has a great impact on the read performance.

在 2020/5/14 上午5:16, Evgenii Zhuravlev 写道:
Can you share full logs from all nodes?

вт, 12 мая 2020 г. в 18:24, 38797715 <[hidden email]>:

Hi Evgenii,

The storage used is not SSD.

We will use different versions of ignite for further testing, such as ignite2.8.
Ignite is configured as follows:

<?xml version="1.0" encoding="UTF-8"?>
<bean id="ignite.cfg" class="org.apache.ignite.configuration.IgniteConfiguration">
<property name="peerClassLoadingEnabled" value="true"/>
<property name="consistentId" value="20"/>
<property name="failureDetectionTimeout" value="120000"/>
<property name="workDirectory" value="/appdata/ignite"/>
<property name="rebalanceBatchSize" value="#{2 * 1024 * 1024}"/>
<property name="rebalanceThrottle" value="100"/>
<property name="rebalanceThreadPoolSize" value="4"/>
<property name="gridLogger">
<bean class="org.apache.ignite.logger.log4j2.Log4J2Logger">
<constructor-arg type="java.lang.String" value="config/ignite-log4j2.xml"/>
</bean>
</property>
<property name="cacheConfiguration">
<list>
<bean id="partitioned-cache-template" abstract="true" class="org.apache.ignite.configuration.CacheConfiguration">
<property name="name" value="cache-partitioned*"/>
<property name="cacheMode" value="PARTITIONED" />
<property name="backups" value="1" />
<property name="queryParallelism" value="16"/>
<property name="partitionLossPolicy" value="READ_ONLY_SAFE"/>
</bean>
<bean id="replicated-cache-template" abstract="true" class="org.apache.ignite.configuration.CacheConfiguration">
<property name="name" value="cache-replicated*"/>
<property name="cacheMode" value="REPLICATED" />
<property name="partitionLossPolicy" value="READ_ONLY_SAFE"/>
</bean>
</list>
</property>
<!-- Enabling Apache Ignite Persistent Store. -->
<property name="dataStorageConfiguration">
<bean class="org.apache.ignite.configuration.DataStorageConfiguration">
<property name="defaultDataRegionConfiguration">
<bean class="org.apache.ignite.configuration.DataRegionConfiguration">
<property name="persistenceEnabled" value="true"/>
<property name="maxSize" value="#{200L * 1024 * 1024 * 1024}"/>
</bean>
</property>
</bean>
</property>
</bean>
</beans>
在 2020/5/13 上午4:45, Evgenii Zhuravlev 写道:
Hi,

Can you share full logs and configuration? What disk so you use?

Evgenii

вт, 12 мая 2020 г. в 06:49, 38797715 <[hidden email]>:

Among them:
CO_CO_NEW: ~ 48 minutes(partitioned,backup=1,33M)

Ignite sys cache: ~ 27 minutes

PLM_ITEM:~3 minutes(repicated,1.9K)


在 2020/5/12 下午9:08, 38797715 写道:

Hi community,

We have 5 servers, 16 cores, 256g memory, and 200g off-heap memory.
We have 7 tables to test, and the data volume is respectively:31.8M,495.2M,552.3M,33M,873.3K,28M,1.9K(replicated),others are partitioned(backup = 1)

VM args:-server -Xms20g -Xmx20g -XX:+AlwaysPreTouch -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M -Xloggc:/data/gc/logs/gclog.txt -Djava.net.preferIPv4Stack=true -XX:MaxDirectMemorySize=256M -XX:+PrintAdaptiveSizePolicy

Today, one of the servers was restarted(kill and then start ignite.sh) for some reason, but the node took 1.5 hours to start, which was much longer than expected.

After analyzing the log, the following information is found:

[2020-05-12T17:00:05,138][INFO ][main][GridCacheDatabaseSharedManager] Found last checkpoint marker [cpId=7a0564f2-43e5-400b-9439-746fc68a6ccb, pos=FileWALPointer [idx=10511, fileOff=51348888, len=61193]]
[2020-05-12T17:00:05,151][INFO ][main][GridCacheDatabaseSharedManager] Binary memory state restored at node startup [restoredPtr=FileWALPointer [idx=10511, fileOff=51410110, len=0]]
[2020-05-12T17:00:05,152][INFO ][main][FileWriteAheadLogManager] Resuming logging to WAL segment [file=/appdata/ignite/db/wal/24/0000000000000001.wal, offset=51410110, ver=2]
[2020-05-12T17:00:06,448][INFO ][main][PageMemoryImpl] Started page memory [memoryAllocated=200.0 GiB, pages=50821088, tableSize=3.9 GiB, checkpointBuffer=2.0 GiB]
[2020-05-12T17:02:08,528][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO_NEW, id=-189779360, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T17:50:44,341][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO_LINE, id=-1588248812, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T17:50:44,366][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=ignite-sys-cache, id=-2100569601, dataRegionName=sysMemPlc, mode=REPLICATED, atomicity=TRANSACTIONAL, backups=2147483647, mvcc=false]
[2020-05-12T18:17:57,071][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO_LINE_NEW, id=1742991829, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:19:54,910][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=PI_COM_DAY, id=-1904194728, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:19:54,949][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=PLM_ITEM, id=-1283854143, dataRegionName=default, mode=REPLICATED, atomicity=ATOMIC, backups=2147483647, mvcc=false]
[2020-05-12T18:22:53,662][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO, id=64322847, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:22:54,876][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CUST, id=1684722246, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:22:54,892][INFO ][main][GridCacheDatabaseSharedManager] Binary recovery performed in 4970233 ms.

Among them, binary recovery took 4970 seconds.

Our question is:

1.Why is the start time so long?

2.Is the current state of ignite, with the growth of single node data volume, the restart time will be longer and longer?

3.Do have any suggestions for optimizing the restart time?

ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Binary recovery for a very long time

In reply to this post by ilya.kasnacheev
Hello!

We use H2 engine and we need to create tables within H2. We're not touching much data in the process. I'm not sure this is a factor of slowdown since it fits in 2s here.

Regards,
--
Ilya Kasnacheev


ср, 27 мая 2020 г. в 04:20, 38797715 <[hidden email]>:

Hi,

I enabled debug logging and found the following log output:

[2020-05-23T21:43:58,397][DEBUG][nio-acceptor-tcp-comm-#28%ClusterName1%][TcpCommunicationSpi] Balancing data [min0=0, minIdx=0, max0=-1, maxIdx=-1]
[2020-05-23T21:43:58,405][DEBUG][main][SchemaManager] Creating DB table with SQL: CREATE TABLE "PUBLIC"."CO_CUST" (_KEY VARCHAR INVISIBLE NOT NULL,_VAL OTHER INVISIBLE,"CUST_ID" VARCHAR(30) NOT NULL,"CUST_NAME" VARCHAR(200),"CUST_SHORT_NAME" VARCHAR(200),"CUST_SHORT_ID" VARCHAR(240),"LICENSE_CODE" VARCHAR(30),"STATUS" VARCHAR(16),"COM_ID" VARCHAR(30),"SALE_CENTER_ID" VARCHAR(30),"SALE_DEPT_ID" VARCHAR(30),"SLSMGR_ID" VARCHAR(30),"SLSMAN_ID" VARCHAR(30),"SLSMAN_MOBILE" VARCHAR(16),"MANAGER" VARCHAR(100),"IDENTITY_CARD_ID" VARCHAR(36),"ORDER_TEL" VARCHAR(80),"INV_TYPE" VARCHAR(18),"ORDER_WAY" VARCHAR(18),"PAY_TYPE" VARCHAR(2),"PERIODS" VARCHAR(200),"PRD_ST_DATE" VARCHAR(8),"ORDER_CUST_ID" VARCHAR(30),"BUSI_ADDR" VARCHAR(400),"WORK_PORT" VARCHAR(18),"BASE_TYPE" VARCHAR(24),"SALE_SCOPE" VARCHAR(1),"SCOPE" VARCHAR(200),"COM_CHARA" VARCHAR(2),"INNER_TYPE" VARCHAR(18),"CUST_KIND" VARCHAR(10),"CUST_KIND_NAME" VARCHAR(120),"CUST_TYPE" VARCHAR(6),"CUST_TYPE1" VARCHAR(6),"CUST_TYPE2" VARCHAR(6),"CUST_TYPE3" VARCHAR(6),"CUST_TYPE4" VARCHAR(6),"CUST_TYPE5" VARCHAR(6),"AREA_TYPE" VARCHAR(2),"IS_SEFL_CUST" VARCHAR(1),"IS_FUNC_CUST" VARCHAR(1),"MANAGER_BIRTHDAY" VARCHAR(8),"CELEBRATE_DATE" VARCHAR(8),"NATION_CUST_CODE" VARCHAR(30),"LONGITUDE" DECIMAL,"LATITUDE" DECIMAL,"AGENT_MESSAGE" VARCHAR(400),"NOTE" VARCHAR(1000),"UPDATE_TIME" TIMESTAMP,"INVTY_ID" VARCHAR(30),"STEP_ID" VARCHAR(20),"INV_CUST_ID" VARCHAR(60),"INV_UNIT_NAME" VARCHAR(200),"ACCOUNT" VARCHAR(200),"BANK" VARCHAR(100),"TAX_ACCOUNT" VARCHAR(120),"OTHER_ORDER_WAY" VARCHAR(32),"SALE_AVG" DECIMAL,"ITEM_ORD" DECIMAL,"QTY_SOLD" DECIMAL,"AMT_SOLD" DECIMAL,"MANAGER_TEL" VARCHAR(80),"IS_SALE_LARGE" VARCHAR(1),"TAX_TEL" VARCHAR(60),"TAX_ADDR" VARCHAR(400),"IS_TOR_TAX" VARCHAR(1),"CUST_TYPE6" VARCHAR(6),"CUST_TYPE7" VARCHAR(6),"CUST_TYPE8" VARCHAR(124),"CUST_TYPE9" VARCHAR(60),"CUST_TYPE10" VARCHAR(200),"CANT_ID" VARCHAR(20),"IS_ONLINE_PAY" VARCHAR(1),"IS_RAIL_CUST" VARCHAR(1),"CUST_SEG" VARCHAR(30),"QTY_MULTIPLE" VARCHAR(20),"TAX_ADRR" VARCHAR(200),"COLLECT_STAFF_ID" VARCHAR(30),"ITEM_HEIGHT" DECIMAL,"AREA_ID" VARCHAR(30),"BASE_TYPE_EXT" VARCHAR(30),"AREA_TYPE_EXT" VARCHAR(30),"WORK_PORT_EXT" VARCHAR(30),"CUST_SEG_EXT" VARCHAR(30),"IS_CIGAR_CUST" VARCHAR(1))
[2020-05-23T21:43:58,411][DEBUG][main][IgniteH2Indexing] Creating cache index [cacheId=1684722246, idxName=_key_PK]
[2020-05-23T21:43:59,081][DEBUG][main][IgniteH2Indexing] Creating cache index [cacheId=1684722246, idxName=IDX_CO_CUST_SALE_CENTER_ID]
[2020-05-23T21:43:59,088][DEBUG][main][IgniteH2Indexing] Creating cache index [cacheId=1684722246, idxName=IDX_CO_CUST_STATUS]
[2020-05-23T21:43:59,099][DEBUG][main][IgniteH2Indexing] Creating cache index [cacheId=1684722246, idxName=IDX_CO_CUST_SALE_DEPT_ID]
[2020-05-23T21:43:59,109][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CUST, id=1684722246, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]

It can be seen from the log that all tables and indexes need to be rebuilt during node startup? Just don't load data?

在 2020/5/21 下午8:48, Ilya Kasnacheev 写道:
Hello!

1. I guess that WAL is read.
2. Unfortunately we do not have truly graceful exit as far as my understanding goes.

Regards,
--
Ilya Kasnacheev


вт, 19 мая 2020 г. в 10:22, 38797715 <[hidden email]>:

Hi,

the following log message:

[2020-05-12T18:17:57,071][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO_LINE_NEW, id=1742991829, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]

I have the following questions:

1.What has been done in the startup cache in recovery mode?

2.After testing, if the node stops normally (non abnormal shutdown), the recovery process will also be performed during startup. Why?

在 2020/5/18 下午9:58, Ilya Kasnacheev 写道:
Hello!

Direct IO module is experimental and should not be used unless performance is tested first, in your specific use case.

Regards,
--
Ilya Kasnacheev


пн, 18 мая 2020 г. в 16:47, 38797715 <[hidden email]>:

Hi,

If direct IO is disabled, the startup speed will be doubled, including some other tests. I find that direct IO has a great impact on the read performance.

在 2020/5/14 上午5:16, Evgenii Zhuravlev 写道:
Can you share full logs from all nodes?

вт, 12 мая 2020 г. в 18:24, 38797715 <[hidden email]>:

Hi Evgenii,

The storage used is not SSD.

We will use different versions of ignite for further testing, such as ignite2.8.
Ignite is configured as follows:

<?xml version="1.0" encoding="UTF-8"?>
<bean id="ignite.cfg" class="org.apache.ignite.configuration.IgniteConfiguration">
<property name="peerClassLoadingEnabled" value="true"/>
<property name="consistentId" value="20"/>
<property name="failureDetectionTimeout" value="120000"/>
<property name="workDirectory" value="/appdata/ignite"/>
<property name="rebalanceBatchSize" value="#{2 * 1024 * 1024}"/>
<property name="rebalanceThrottle" value="100"/>
<property name="rebalanceThreadPoolSize" value="4"/>
<property name="gridLogger">
<bean class="org.apache.ignite.logger.log4j2.Log4J2Logger">
<constructor-arg type="java.lang.String" value="config/ignite-log4j2.xml"/>
</bean>
</property>
<property name="cacheConfiguration">
<list>
<bean id="partitioned-cache-template" abstract="true" class="org.apache.ignite.configuration.CacheConfiguration">
<property name="name" value="cache-partitioned*"/>
<property name="cacheMode" value="PARTITIONED" />
<property name="backups" value="1" />
<property name="queryParallelism" value="16"/>
<property name="partitionLossPolicy" value="READ_ONLY_SAFE"/>
</bean>
<bean id="replicated-cache-template" abstract="true" class="org.apache.ignite.configuration.CacheConfiguration">
<property name="name" value="cache-replicated*"/>
<property name="cacheMode" value="REPLICATED" />
<property name="partitionLossPolicy" value="READ_ONLY_SAFE"/>
</bean>
</list>
</property>
<!-- Enabling Apache Ignite Persistent Store. -->
<property name="dataStorageConfiguration">
<bean class="org.apache.ignite.configuration.DataStorageConfiguration">
<property name="defaultDataRegionConfiguration">
<bean class="org.apache.ignite.configuration.DataRegionConfiguration">
<property name="persistenceEnabled" value="true"/>
<property name="maxSize" value="#{200L * 1024 * 1024 * 1024}"/>
</bean>
</property>
</bean>
</property>
</bean>
</beans>
在 2020/5/13 上午4:45, Evgenii Zhuravlev 写道:
Hi,

Can you share full logs and configuration? What disk so you use?

Evgenii

вт, 12 мая 2020 г. в 06:49, 38797715 <[hidden email]>:

Among them:
CO_CO_NEW: ~ 48 minutes(partitioned,backup=1,33M)

Ignite sys cache: ~ 27 minutes

PLM_ITEM:~3 minutes(repicated,1.9K)


在 2020/5/12 下午9:08, 38797715 写道:

Hi community,

We have 5 servers, 16 cores, 256g memory, and 200g off-heap memory.
We have 7 tables to test, and the data volume is respectively:31.8M,495.2M,552.3M,33M,873.3K,28M,1.9K(replicated),others are partitioned(backup = 1)

VM args:-server -Xms20g -Xmx20g -XX:+AlwaysPreTouch -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M -Xloggc:/data/gc/logs/gclog.txt -Djava.net.preferIPv4Stack=true -XX:MaxDirectMemorySize=256M -XX:+PrintAdaptiveSizePolicy

Today, one of the servers was restarted(kill and then start ignite.sh) for some reason, but the node took 1.5 hours to start, which was much longer than expected.

After analyzing the log, the following information is found:

[2020-05-12T17:00:05,138][INFO ][main][GridCacheDatabaseSharedManager] Found last checkpoint marker [cpId=7a0564f2-43e5-400b-9439-746fc68a6ccb, pos=FileWALPointer [idx=10511, fileOff=51348888, len=61193]]
[2020-05-12T17:00:05,151][INFO ][main][GridCacheDatabaseSharedManager] Binary memory state restored at node startup [restoredPtr=FileWALPointer [idx=10511, fileOff=51410110, len=0]]
[2020-05-12T17:00:05,152][INFO ][main][FileWriteAheadLogManager] Resuming logging to WAL segment [file=/appdata/ignite/db/wal/24/0000000000000001.wal, offset=51410110, ver=2]
[2020-05-12T17:00:06,448][INFO ][main][PageMemoryImpl] Started page memory [memoryAllocated=200.0 GiB, pages=50821088, tableSize=3.9 GiB, checkpointBuffer=2.0 GiB]
[2020-05-12T17:02:08,528][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO_NEW, id=-189779360, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T17:50:44,341][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO_LINE, id=-1588248812, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T17:50:44,366][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=ignite-sys-cache, id=-2100569601, dataRegionName=sysMemPlc, mode=REPLICATED, atomicity=TRANSACTIONAL, backups=2147483647, mvcc=false]
[2020-05-12T18:17:57,071][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO_LINE_NEW, id=1742991829, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:19:54,910][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=PI_COM_DAY, id=-1904194728, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:19:54,949][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=PLM_ITEM, id=-1283854143, dataRegionName=default, mode=REPLICATED, atomicity=ATOMIC, backups=2147483647, mvcc=false]
[2020-05-12T18:22:53,662][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CO, id=64322847, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:22:54,876][INFO ][main][GridCacheProcessor] Started cache in recovery mode [name=CO_CUST, id=1684722246, dataRegionName=default, mode=PARTITIONED, atomicity=ATOMIC, backups=1, mvcc=false]
[2020-05-12T18:22:54,892][INFO ][main][GridCacheDatabaseSharedManager] Binary recovery performed in 4970233 ms.

Among them, binary recovery took 4970 seconds.

Our question is:

1.Why is the start time so long?

2.Is the current state of ignite, with the growth of single node data volume, the restart time will be longer and longer?

3.Do have any suggestions for optimizing the restart time?