very fast loading of very big table

classic Classic list List threaded Threaded
16 messages Options
Vladimir Tchernyi Vladimir Tchernyi
Reply | Threaded
Open this post in threaded view
|

very fast loading of very big table

Sep 2020 I've published the paper about Loading Large Datasets into Apache Ignite by Using a Key-Value API (English [1] and Russian [2] version). The approach described works in production, but shows inacceptable perfomance for very large tables.

The story continues, and yesterday I've finished the proof of concept for very fast loading of very big table. The partitioned MSSQL table about 295 million rows was loaded by the 4-node Ignite cluster in 3 min 35 sec. Each node had executed its own SQL queries in parallel and then distributed the loaded values across the other cluster nodes.

Probably that result will be of interest for the community.

Regards,
Vladimir Chernyi

[1] https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
[2] https://m.habr.com/ru/post/526708/

dmagda dmagda
Reply | Threaded
Open this post in threaded view
|

Re: very fast loading of very big table

Hello Vladimir, 

Good to hear from you! How much is that in gigabytes?

-
Denis


On Thu, Feb 18, 2021 at 10:06 PM <[hidden email]> wrote:
Sep 2020 I've published the paper about Loading Large Datasets into Apache Ignite by Using a Key-Value API (English [1] and Russian [2] version). The approach described works in production, but shows inacceptable perfomance for very large tables.

The story continues, and yesterday I've finished the proof of concept for very fast loading of very big table. The partitioned MSSQL table about 295 million rows was loaded by the 4-node Ignite cluster in 3 min 35 sec. Each node had executed its own SQL queries in parallel and then distributed the loaded values across the other cluster nodes.

Probably that result will be of interest for the community.

Regards,
Vladimir Chernyi


Vladimir Tchernyi Vladimir Tchernyi
Reply | Threaded
Open this post in threaded view
|

Re: very fast loading of very big table

Hi Denis,

Data space is 3.7Gb according to MSSQL table properries

Vladimir

9:47, 19 февраля 2021 г., Denis Magda <[hidden email]>:
Hello Vladimir, 

Good to hear from you! How much is that in gigabytes?

-
Denis


On Thu, Feb 18, 2021 at 10:06 PM <[hidden email]> wrote:
Sep 2020 I've published the paper about Loading Large Datasets into Apache Ignite by Using a Key-Value API (English [1] and Russian [2] version). The approach described works in production, but shows inacceptable perfomance for very large tables.

The story continues, and yesterday I've finished the proof of concept for very fast loading of very big table. The partitioned MSSQL table about 295 million rows was loaded by the 4-node Ignite cluster in 3 min 35 sec. Each node had executed its own SQL queries in parallel and then distributed the loaded values across the other cluster nodes.

Probably that result will be of interest for the community.

Regards,
Vladimir Chernyi




--
Отправлено из мобильного приложения Яндекс.Почты
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: very fast loading of very big table

Hello!

Is there a chance that you have tried enabling streaming (data streamer) on the clients?

Regards,
--
Ilya Kasnacheev


пт, 19 февр. 2021 г. в 10:10, <[hidden email]>:
Hi Denis,

Data space is 3.7Gb according to MSSQL table properries

Vladimir

9:47, 19 февраля 2021 г., Denis Magda <[hidden email]>:
Hello Vladimir, 

Good to hear from you! How much is that in gigabytes?

-
Denis


On Thu, Feb 18, 2021 at 10:06 PM <[hidden email]> wrote:
Sep 2020 I've published the paper about Loading Large Datasets into Apache Ignite by Using a Key-Value API (English [1] and Russian [2] version). The approach described works in production, but shows inacceptable perfomance for very large tables.

The story continues, and yesterday I've finished the proof of concept for very fast loading of very big table. The partitioned MSSQL table about 295 million rows was loaded by the 4-node Ignite cluster in 3 min 35 sec. Each node had executed its own SQL queries in parallel and then distributed the loaded values across the other cluster nodes.

Probably that result will be of interest for the community.

Regards,
Vladimir Chernyi




--
Отправлено из мобильного приложения Яндекс.Почты
ptupitsyn ptupitsyn
Reply | Threaded
Open this post in threaded view
|

Re: very fast loading of very big table

> 295 million rows
> 3 min 35 sec

Agree with Ilya, DataStreamer should do this much faster, have you tried it?

> 3.7Gb

I would not call this "big" by any means today, when even the cheapest laptops have 8GB of RAM.


On Fri, Feb 19, 2021 at 1:33 PM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

Is there a chance that you have tried enabling streaming (data streamer) on the clients?

Regards,
--
Ilya Kasnacheev


пт, 19 февр. 2021 г. в 10:10, <[hidden email]>:
Hi Denis,

Data space is 3.7Gb according to MSSQL table properries

Vladimir

9:47, 19 февраля 2021 г., Denis Magda <[hidden email]>:
Hello Vladimir, 

Good to hear from you! How much is that in gigabytes?

-
Denis


On Thu, Feb 18, 2021 at 10:06 PM <[hidden email]> wrote:
Sep 2020 I've published the paper about Loading Large Datasets into Apache Ignite by Using a Key-Value API (English [1] and Russian [2] version). The approach described works in production, but shows inacceptable perfomance for very large tables.

The story continues, and yesterday I've finished the proof of concept for very fast loading of very big table. The partitioned MSSQL table about 295 million rows was loaded by the 4-node Ignite cluster in 3 min 35 sec. Each node had executed its own SQL queries in parallel and then distributed the loaded values across the other cluster nodes.

Probably that result will be of interest for the community.

Regards,
Vladimir Chernyi




--
Отправлено из мобильного приложения Яндекс.Почты
Taras Ledkov Taras Ledkov
Reply | Threaded
Open this post in threaded view
|

Re: very fast loading of very big table

In reply to this post by Vladimir Tchernyi

Hi Vladimir,

Did you try to use SQL command 'COPY FROM <csv_file>' via thin JDBC?
This command uses 'IgniteDataStreamer' to write data into cluster and parse CSV on the server node.

PS. AFAIK IgniteDataStreamer is one of the fastest ways to load data.

Hi Denis,

Data space is 3.7Gb according to MSSQL table properries

Vladimir

9:47, 19 февраля 2021 г., Denis Magda [hidden email]:
Hello Vladimir, 

Good to hear from you! How much is that in gigabytes?

-
Denis


On Thu, Feb 18, 2021 at 10:06 PM <[hidden email]> wrote:
Sep 2020 I've published the paper about Loading Large Datasets into Apache Ignite by Using a Key-Value API (English [1] and Russian [2] version). The approach described works in production, but shows inacceptable perfomance for very large tables.

The story continues, and yesterday I've finished the proof of concept for very fast loading of very big table. The partitioned MSSQL table about 295 million rows was loaded by the 4-node Ignite cluster in 3 min 35 sec. Each node had executed its own SQL queries in parallel and then distributed the loaded values across the other cluster nodes.

Probably that result will be of interest for the community.

Regards,
Vladimir Chernyi




--
Отправлено из мобильного приложения Яндекс.Почты
-- 
Taras Ledkov
Mail-To: [hidden email]
maxi628 maxi628
Reply | Threaded
Open this post in threaded view
|

Re: very fast loading of very big table

What would be the difference between doing cache.putAll(all rows) and separating them by affinity key+executing putAll inside a compute job.
If I'm not mistaken, doing putAll should end up splitting those rows by affinity key in one of the servers, right?
Is there a comparison of that?

On Fri, Feb 19, 2021 at 9:51 AM Taras Ledkov <[hidden email]> wrote:

Hi Vladimir,

Did you try to use SQL command 'COPY FROM <csv_file>' via thin JDBC?
This command uses 'IgniteDataStreamer' to write data into cluster and parse CSV on the server node.

PS. AFAIK IgniteDataStreamer is one of the fastest ways to load data.

Hi Denis,

Data space is 3.7Gb according to MSSQL table properries

Vladimir

9:47, 19 февраля 2021 г., Denis Magda [hidden email]:
Hello Vladimir, 

Good to hear from you! How much is that in gigabytes?

-
Denis


On Thu, Feb 18, 2021 at 10:06 PM <[hidden email]> wrote:
Sep 2020 I've published the paper about Loading Large Datasets into Apache Ignite by Using a Key-Value API (English [1] and Russian [2] version). The approach described works in production, but shows inacceptable perfomance for very large tables.

The story continues, and yesterday I've finished the proof of concept for very fast loading of very big table. The partitioned MSSQL table about 295 million rows was loaded by the 4-node Ignite cluster in 3 min 35 sec. Each node had executed its own SQL queries in parallel and then distributed the loaded values across the other cluster nodes.

Probably that result will be of interest for the community.

Regards,
Vladimir Chernyi




--
Отправлено из мобильного приложения Яндекс.Почты
-- 
Taras Ledkov
Mail-To: [hidden email]
stephendarlington stephendarlington
Reply | Threaded
Open this post in threaded view
|

Re: very fast loading of very big table

I think it’s more that that putAll is mostly atomic, so the more records you save in one chunk, the more locking, etc. happens. Distributing as compute jobs means all the putAlls will be local which is beneficial, and the size of each put is going to be smaller (also beneficial).

But that’s a lot of work that the data streamer already does for you and the data streamer also batches updates so would still be faster.

On 19 Feb 2021, at 13:33, Maximiliano Gazquez <[hidden email]> wrote:

What would be the difference between doing cache.putAll(all rows) and separating them by affinity key+executing putAll inside a compute job.
If I'm not mistaken, doing putAll should end up splitting those rows by affinity key in one of the servers, right?
Is there a comparison of that?

On Fri, Feb 19, 2021 at 9:51 AM Taras Ledkov <[hidden email]> wrote:

Hi Vladimir,

Did you try to use SQL command 'COPY FROM <csv_file>' via thin JDBC?
This command uses 'IgniteDataStreamer' to write data into cluster and parse CSV on the server node.

PS. AFAIK IgniteDataStreamer is one of the fastest ways to load data.

Hi Denis,

Data space is 3.7Gb according to MSSQL table properries

Vladimir

9:47, 19 февраля 2021 г., Denis Magda [hidden email]:
Hello Vladimir, 

Good to hear from you! How much is that in gigabytes?

-
Denis


On Thu, Feb 18, 2021 at 10:06 PM <[hidden email]> wrote:
Sep 2020 I've published the paper about Loading Large Datasets into Apache Ignite by Using a Key-Value API (English [1] and Russian [2] version). The approach described works in production, but shows inacceptable perfomance for very large tables.

The story continues, and yesterday I've finished the proof of concept for very fast loading of very big table. The partitioned MSSQL table about 295 million rows was loaded by the 4-node Ignite cluster in 3 min 35 sec. Each node had executed its own SQL queries in parallel and then distributed the loaded values across the other cluster nodes.

Probably that result will be of interest for the community.

Regards,
Vladimir Chernyi




--
Отправлено из мобильного приложения Яндекс.Почты
-- 
Taras Ledkov
Mail-To: [hidden email]


Vladimir Tchernyi Vladimir Tchernyi
Reply | Threaded
Open this post in threaded view
|

Re: very fast loading of very big table

Hi folks,
thanks for your interest in my work.

I didn't try COPY FROM since I've tried to work with Ignite SQL a couple of years ago and didn't succeed. Probably because examples available aren't complete/downloadable/compilable (the paper [1] contains GitHub repo, that is my five cents in changing the status quo). My interest is in KV API.

I did try a data streamer, and that was my first try. I did not notice a significant time reduction in using code from my paper [1] versus data streamer/receiver. There was some memory economy with the streamer, though. I must say my experiment was made on a heavily loaded production mssql server. Filtered query with 300K rows resultset takes about 15 sec. The story follows.

First of all, I tried to select the whole table as once, I got the network timeout and the client node was dropped off the cluster (is node still alive?).
So I'd partitioned the table and executed a number of queries one-by-one on the client node, each query for the specific table partition. That process took about 90 min. Inacceptable time.

Then I tried to execute my queries in parallel on the client node, each query executing dataStreamer.addData() for a single dataStreamer. The timing was not less than 15 min. All the attempts were the same, probably that was the network throughput limit on the client node (same interface used for the resultset and for cluster intercom). Say it again - that was the production environment.

Final schema:
* ComputeTask.map() schedules ComputeJobs amongst cluster nodes, one job for one table partition;
* each job executes SQL query, constructs a map with binary object key and value. Then the job executes targetCache.invokeAll() specifying the constructed map and the static EntryProcessor class. The EntryProcessor contains the logic for cache binary entry update;
* ComputeTask.reduce() summarizes the row count reported by each job.

The schema described proved to be network error-free in my production network and gives acceptable timing.

Vladimir

[1] https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api

пт, 19 февр. 2021 г. в 16:41, Stephen Darlington <[hidden email]>:
I think it’s more that that putAll is mostly atomic, so the more records you save in one chunk, the more locking, etc. happens. Distributing as compute jobs means all the putAlls will be local which is beneficial, and the size of each put is going to be smaller (also beneficial).

But that’s a lot of work that the data streamer already does for you and the data streamer also batches updates so would still be faster.

On 19 Feb 2021, at 13:33, Maximiliano Gazquez <[hidden email]> wrote:

What would be the difference between doing cache.putAll(all rows) and separating them by affinity key+executing putAll inside a compute job.
If I'm not mistaken, doing putAll should end up splitting those rows by affinity key in one of the servers, right?
Is there a comparison of that?

On Fri, Feb 19, 2021 at 9:51 AM Taras Ledkov <[hidden email]> wrote:

Hi Vladimir,

Did you try to use SQL command 'COPY FROM <csv_file>' via thin JDBC?
This command uses 'IgniteDataStreamer' to write data into cluster and parse CSV on the server node.

PS. AFAIK IgniteDataStreamer is one of the fastest ways to load data.

Hi Denis,

Data space is 3.7Gb according to MSSQL table properries

Vladimir

9:47, 19 февраля 2021 г., Denis Magda [hidden email]:
Hello Vladimir, 

Good to hear from you! How much is that in gigabytes?

-
Denis


On Thu, Feb 18, 2021 at 10:06 PM <[hidden email]> wrote:
Sep 2020 I've published the paper about Loading Large Datasets into Apache Ignite by Using a Key-Value API (English [1] and Russian [2] version). The approach described works in production, but shows inacceptable perfomance for very large tables.

The story continues, and yesterday I've finished the proof of concept for very fast loading of very big table. The partitioned MSSQL table about 295 million rows was loaded by the 4-node Ignite cluster in 3 min 35 sec. Each node had executed its own SQL queries in parallel and then distributed the loaded values across the other cluster nodes.

Probably that result will be of interest for the community.

Regards,
Vladimir Chernyi




--
Отправлено из мобильного приложения Яндекс.Почты
-- 
Taras Ledkov
Mail-To: [hidden email]


ptupitsyn ptupitsyn
Reply | Threaded
Open this post in threaded view
|

Re: very fast loading of very big table

> First of all, I tried to select the whole table as once

Hmm, it looks like MSSQL data retrieval may be the bottleneck here, not Ignite.

Can you run a test where some dummy data of the same size as real data is generated and inserted into Ignite,
so that we test Ignite perf only, excluding MSSQL from the equation?
For example, streaming 300 million entries (total size 6 GB) takes around 1 minute on my machine, with a simple single-threaded DataStreamer.

On Fri, Feb 19, 2021 at 4:49 PM Vladimir Tchernyi <[hidden email]> wrote:
Hi folks,
thanks for your interest in my work.

I didn't try COPY FROM since I've tried to work with Ignite SQL a couple of years ago and didn't succeed. Probably because examples available aren't complete/downloadable/compilable (the paper [1] contains GitHub repo, that is my five cents in changing the status quo). My interest is in KV API.

I did try a data streamer, and that was my first try. I did not notice a significant time reduction in using code from my paper [1] versus data streamer/receiver. There was some memory economy with the streamer, though. I must say my experiment was made on a heavily loaded production mssql server. Filtered query with 300K rows resultset takes about 15 sec. The story follows.

First of all, I tried to select the whole table as once, I got the network timeout and the client node was dropped off the cluster (is node still alive?).
So I'd partitioned the table and executed a number of queries one-by-one on the client node, each query for the specific table partition. That process took about 90 min. Inacceptable time.

Then I tried to execute my queries in parallel on the client node, each query executing dataStreamer.addData() for a single dataStreamer. The timing was not less than 15 min. All the attempts were the same, probably that was the network throughput limit on the client node (same interface used for the resultset and for cluster intercom). Say it again - that was the production environment.

Final schema:
* ComputeTask.map() schedules ComputeJobs amongst cluster nodes, one job for one table partition;
* each job executes SQL query, constructs a map with binary object key and value. Then the job executes targetCache.invokeAll() specifying the constructed map and the static EntryProcessor class. The EntryProcessor contains the logic for cache binary entry update;
* ComputeTask.reduce() summarizes the row count reported by each job.

The schema described proved to be network error-free in my production network and gives acceptable timing.

Vladimir

[1] https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api

пт, 19 февр. 2021 г. в 16:41, Stephen Darlington <[hidden email]>:
I think it’s more that that putAll is mostly atomic, so the more records you save in one chunk, the more locking, etc. happens. Distributing as compute jobs means all the putAlls will be local which is beneficial, and the size of each put is going to be smaller (also beneficial).

But that’s a lot of work that the data streamer already does for you and the data streamer also batches updates so would still be faster.

On 19 Feb 2021, at 13:33, Maximiliano Gazquez <[hidden email]> wrote:

What would be the difference between doing cache.putAll(all rows) and separating them by affinity key+executing putAll inside a compute job.
If I'm not mistaken, doing putAll should end up splitting those rows by affinity key in one of the servers, right?
Is there a comparison of that?

On Fri, Feb 19, 2021 at 9:51 AM Taras Ledkov <[hidden email]> wrote:

Hi Vladimir,

Did you try to use SQL command 'COPY FROM <csv_file>' via thin JDBC?
This command uses 'IgniteDataStreamer' to write data into cluster and parse CSV on the server node.

PS. AFAIK IgniteDataStreamer is one of the fastest ways to load data.

Hi Denis,

Data space is 3.7Gb according to MSSQL table properries

Vladimir

9:47, 19 февраля 2021 г., Denis Magda [hidden email]:
Hello Vladimir, 

Good to hear from you! How much is that in gigabytes?

-
Denis


On Thu, Feb 18, 2021 at 10:06 PM <[hidden email]> wrote:
Sep 2020 I've published the paper about Loading Large Datasets into Apache Ignite by Using a Key-Value API (English [1] and Russian [2] version). The approach described works in production, but shows inacceptable perfomance for very large tables.

The story continues, and yesterday I've finished the proof of concept for very fast loading of very big table. The partitioned MSSQL table about 295 million rows was loaded by the 4-node Ignite cluster in 3 min 35 sec. Each node had executed its own SQL queries in parallel and then distributed the loaded values across the other cluster nodes.

Probably that result will be of interest for the community.

Regards,
Vladimir Chernyi




--
Отправлено из мобильного приложения Яндекс.Почты
-- 
Taras Ledkov
Mail-To: [hidden email]


Vladimir Tchernyi Vladimir Tchernyi
Reply | Threaded
Open this post in threaded view
|

Re: very fast loading of very big table

Pavel,

maybe it's time to put your five-cent in. Can you share your code as a GitHub project? Maybe with the script to reproduce 6 GB of data.

As for MSSQL data retrieval being the bottleneck - don't think so, I got 15 min load time for 1 node and 3.5 min time for 4 nodes. Looks like a linear dependency (the table and the RDBMS server were the same).
--
Vladimir

пт, 19 февр. 2021 г. в 19:47, Pavel Tupitsyn <[hidden email]>:
> First of all, I tried to select the whole table as once

Hmm, it looks like MSSQL data retrieval may be the bottleneck here, not Ignite.

Can you run a test where some dummy data of the same size as real data is generated and inserted into Ignite,
so that we test Ignite perf only, excluding MSSQL from the equation?
For example, streaming 300 million entries (total size 6 GB) takes around 1 minute on my machine, with a simple single-threaded DataStreamer.

On Fri, Feb 19, 2021 at 4:49 PM Vladimir Tchernyi <[hidden email]> wrote:
Hi folks,
thanks for your interest in my work.

I didn't try COPY FROM since I've tried to work with Ignite SQL a couple of years ago and didn't succeed. Probably because examples available aren't complete/downloadable/compilable (the paper [1] contains GitHub repo, that is my five cents in changing the status quo). My interest is in KV API.

I did try a data streamer, and that was my first try. I did not notice a significant time reduction in using code from my paper [1] versus data streamer/receiver. There was some memory economy with the streamer, though. I must say my experiment was made on a heavily loaded production mssql server. Filtered query with 300K rows resultset takes about 15 sec. The story follows.

First of all, I tried to select the whole table as once, I got the network timeout and the client node was dropped off the cluster (is node still alive?).
So I'd partitioned the table and executed a number of queries one-by-one on the client node, each query for the specific table partition. That process took about 90 min. Inacceptable time.

Then I tried to execute my queries in parallel on the client node, each query executing dataStreamer.addData() for a single dataStreamer. The timing was not less than 15 min. All the attempts were the same, probably that was the network throughput limit on the client node (same interface used for the resultset and for cluster intercom). Say it again - that was the production environment.

Final schema:
* ComputeTask.map() schedules ComputeJobs amongst cluster nodes, one job for one table partition;
* each job executes SQL query, constructs a map with binary object key and value. Then the job executes targetCache.invokeAll() specifying the constructed map and the static EntryProcessor class. The EntryProcessor contains the logic for cache binary entry update;
* ComputeTask.reduce() summarizes the row count reported by each job.

The schema described proved to be network error-free in my production network and gives acceptable timing.

Vladimir

[1] https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api

пт, 19 февр. 2021 г. в 16:41, Stephen Darlington <[hidden email]>:
I think it’s more that that putAll is mostly atomic, so the more records you save in one chunk, the more locking, etc. happens. Distributing as compute jobs means all the putAlls will be local which is beneficial, and the size of each put is going to be smaller (also beneficial).

But that’s a lot of work that the data streamer already does for you and the data streamer also batches updates so would still be faster.

On 19 Feb 2021, at 13:33, Maximiliano Gazquez <[hidden email]> wrote:

What would be the difference between doing cache.putAll(all rows) and separating them by affinity key+executing putAll inside a compute job.
If I'm not mistaken, doing putAll should end up splitting those rows by affinity key in one of the servers, right?
Is there a comparison of that?

On Fri, Feb 19, 2021 at 9:51 AM Taras Ledkov <[hidden email]> wrote:

Hi Vladimir,

Did you try to use SQL command 'COPY FROM <csv_file>' via thin JDBC?
This command uses 'IgniteDataStreamer' to write data into cluster and parse CSV on the server node.

PS. AFAIK IgniteDataStreamer is one of the fastest ways to load data.

Hi Denis,

Data space is 3.7Gb according to MSSQL table properries

Vladimir

9:47, 19 февраля 2021 г., Denis Magda [hidden email]:
Hello Vladimir, 

Good to hear from you! How much is that in gigabytes?

-
Denis


On Thu, Feb 18, 2021 at 10:06 PM <[hidden email]> wrote:
Sep 2020 I've published the paper about Loading Large Datasets into Apache Ignite by Using a Key-Value API (English [1] and Russian [2] version). The approach described works in production, but shows inacceptable perfomance for very large tables.

The story continues, and yesterday I've finished the proof of concept for very fast loading of very big table. The partitioned MSSQL table about 295 million rows was loaded by the 4-node Ignite cluster in 3 min 35 sec. Each node had executed its own SQL queries in parallel and then distributed the loaded values across the other cluster nodes.

Probably that result will be of interest for the community.

Regards,
Vladimir Chernyi




--
Отправлено из мобильного приложения Яндекс.Почты
-- 
Taras Ledkov
Mail-To: [hidden email]


Vladimir Tchernyi Vladimir Tchernyi
Reply | Threaded
Open this post in threaded view
|

Re: very fast loading of very big table

Hi, 

I'd spent some time thinking about the community comments on my post. It seems that Ignite is really not a bottleneck here. The performance of my production MSSQL is a given restriction and the problem is to ensure fast loading by executing multiple parallel queries. I'll test my code in production for a couple of months for possible problems. If it will be OK, probably the complete/downloadable/compilable GitHub example will be useful for the community.

WDYT?

пт, 19 февр. 2021 г. в 21:47, Vladimir Tchernyi <[hidden email]>:
Pavel,

maybe it's time to put your five-cent in. Can you share your code as a GitHub project? Maybe with the script to reproduce 6 GB of data.

As for MSSQL data retrieval being the bottleneck - don't think so, I got 15 min load time for 1 node and 3.5 min time for 4 nodes. Looks like a linear dependency (the table and the RDBMS server were the same).
--
Vladimir

пт, 19 февр. 2021 г. в 19:47, Pavel Tupitsyn <[hidden email]>:
> First of all, I tried to select the whole table as once

Hmm, it looks like MSSQL data retrieval may be the bottleneck here, not Ignite.

Can you run a test where some dummy data of the same size as real data is generated and inserted into Ignite,
so that we test Ignite perf only, excluding MSSQL from the equation?
For example, streaming 300 million entries (total size 6 GB) takes around 1 minute on my machine, with a simple single-threaded DataStreamer.

On Fri, Feb 19, 2021 at 4:49 PM Vladimir Tchernyi <[hidden email]> wrote:
Hi folks,
thanks for your interest in my work.

I didn't try COPY FROM since I've tried to work with Ignite SQL a couple of years ago and didn't succeed. Probably because examples available aren't complete/downloadable/compilable (the paper [1] contains GitHub repo, that is my five cents in changing the status quo). My interest is in KV API.

I did try a data streamer, and that was my first try. I did not notice a significant time reduction in using code from my paper [1] versus data streamer/receiver. There was some memory economy with the streamer, though. I must say my experiment was made on a heavily loaded production mssql server. Filtered query with 300K rows resultset takes about 15 sec. The story follows.

First of all, I tried to select the whole table as once, I got the network timeout and the client node was dropped off the cluster (is node still alive?).
So I'd partitioned the table and executed a number of queries one-by-one on the client node, each query for the specific table partition. That process took about 90 min. Inacceptable time.

Then I tried to execute my queries in parallel on the client node, each query executing dataStreamer.addData() for a single dataStreamer. The timing was not less than 15 min. All the attempts were the same, probably that was the network throughput limit on the client node (same interface used for the resultset and for cluster intercom). Say it again - that was the production environment.

Final schema:
* ComputeTask.map() schedules ComputeJobs amongst cluster nodes, one job for one table partition;
* each job executes SQL query, constructs a map with binary object key and value. Then the job executes targetCache.invokeAll() specifying the constructed map and the static EntryProcessor class. The EntryProcessor contains the logic for cache binary entry update;
* ComputeTask.reduce() summarizes the row count reported by each job.

The schema described proved to be network error-free in my production network and gives acceptable timing.

Vladimir

[1] https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api

пт, 19 февр. 2021 г. в 16:41, Stephen Darlington <[hidden email]>:
I think it’s more that that putAll is mostly atomic, so the more records you save in one chunk, the more locking, etc. happens. Distributing as compute jobs means all the putAlls will be local which is beneficial, and the size of each put is going to be smaller (also beneficial).

But that’s a lot of work that the data streamer already does for you and the data streamer also batches updates so would still be faster.

On 19 Feb 2021, at 13:33, Maximiliano Gazquez <[hidden email]> wrote:

What would be the difference between doing cache.putAll(all rows) and separating them by affinity key+executing putAll inside a compute job.
If I'm not mistaken, doing putAll should end up splitting those rows by affinity key in one of the servers, right?
Is there a comparison of that?

On Fri, Feb 19, 2021 at 9:51 AM Taras Ledkov <[hidden email]> wrote:

Hi Vladimir,

Did you try to use SQL command 'COPY FROM <csv_file>' via thin JDBC?
This command uses 'IgniteDataStreamer' to write data into cluster and parse CSV on the server node.

PS. AFAIK IgniteDataStreamer is one of the fastest ways to load data.

Hi Denis,

Data space is 3.7Gb according to MSSQL table properries

Vladimir

9:47, 19 февраля 2021 г., Denis Magda [hidden email]:
Hello Vladimir, 

Good to hear from you! How much is that in gigabytes?

-
Denis


On Thu, Feb 18, 2021 at 10:06 PM <[hidden email]> wrote:
Sep 2020 I've published the paper about Loading Large Datasets into Apache Ignite by Using a Key-Value API (English [1] and Russian [2] version). The approach described works in production, but shows inacceptable perfomance for very large tables.

The story continues, and yesterday I've finished the proof of concept for very fast loading of very big table. The partitioned MSSQL table about 295 million rows was loaded by the 4-node Ignite cluster in 3 min 35 sec. Each node had executed its own SQL queries in parallel and then distributed the loaded values across the other cluster nodes.

Probably that result will be of interest for the community.

Regards,
Vladimir Chernyi




--
Отправлено из мобильного приложения Яндекс.Почты
-- 
Taras Ledkov
Mail-To: [hidden email]


ptupitsyn ptupitsyn
Reply | Threaded
Open this post in threaded view
|

Re: very fast loading of very big table

Vladimir,

Thanks for getting back to us. A full example that clarifies the situation will be great!

> Can you share your code as a GitHub project? Maybe with the script to reproduce 6 GB of data.

It is super trivial, I just wanted to get a sense of the throughput and check if we have some kind of a regression in recent versions (we don't) [1]
Also I realised that the data size can be counted very differently - do we account for DB overhead and how?


On Thu, Feb 25, 2021 at 10:49 AM Vladimir Tchernyi <[hidden email]> wrote:
Hi, 

I'd spent some time thinking about the community comments on my post. It seems that Ignite is really not a bottleneck here. The performance of my production MSSQL is a given restriction and the problem is to ensure fast loading by executing multiple parallel queries. I'll test my code in production for a couple of months for possible problems. If it will be OK, probably the complete/downloadable/compilable GitHub example will be useful for the community.

WDYT?

пт, 19 февр. 2021 г. в 21:47, Vladimir Tchernyi <[hidden email]>:
Pavel,

maybe it's time to put your five-cent in. Can you share your code as a GitHub project? Maybe with the script to reproduce 6 GB of data.

As for MSSQL data retrieval being the bottleneck - don't think so, I got 15 min load time for 1 node and 3.5 min time for 4 nodes. Looks like a linear dependency (the table and the RDBMS server were the same).
--
Vladimir

пт, 19 февр. 2021 г. в 19:47, Pavel Tupitsyn <[hidden email]>:
> First of all, I tried to select the whole table as once

Hmm, it looks like MSSQL data retrieval may be the bottleneck here, not Ignite.

Can you run a test where some dummy data of the same size as real data is generated and inserted into Ignite,
so that we test Ignite perf only, excluding MSSQL from the equation?
For example, streaming 300 million entries (total size 6 GB) takes around 1 minute on my machine, with a simple single-threaded DataStreamer.

On Fri, Feb 19, 2021 at 4:49 PM Vladimir Tchernyi <[hidden email]> wrote:
Hi folks,
thanks for your interest in my work.

I didn't try COPY FROM since I've tried to work with Ignite SQL a couple of years ago and didn't succeed. Probably because examples available aren't complete/downloadable/compilable (the paper [1] contains GitHub repo, that is my five cents in changing the status quo). My interest is in KV API.

I did try a data streamer, and that was my first try. I did not notice a significant time reduction in using code from my paper [1] versus data streamer/receiver. There was some memory economy with the streamer, though. I must say my experiment was made on a heavily loaded production mssql server. Filtered query with 300K rows resultset takes about 15 sec. The story follows.

First of all, I tried to select the whole table as once, I got the network timeout and the client node was dropped off the cluster (is node still alive?).
So I'd partitioned the table and executed a number of queries one-by-one on the client node, each query for the specific table partition. That process took about 90 min. Inacceptable time.

Then I tried to execute my queries in parallel on the client node, each query executing dataStreamer.addData() for a single dataStreamer. The timing was not less than 15 min. All the attempts were the same, probably that was the network throughput limit on the client node (same interface used for the resultset and for cluster intercom). Say it again - that was the production environment.

Final schema:
* ComputeTask.map() schedules ComputeJobs amongst cluster nodes, one job for one table partition;
* each job executes SQL query, constructs a map with binary object key and value. Then the job executes targetCache.invokeAll() specifying the constructed map and the static EntryProcessor class. The EntryProcessor contains the logic for cache binary entry update;
* ComputeTask.reduce() summarizes the row count reported by each job.

The schema described proved to be network error-free in my production network and gives acceptable timing.

Vladimir

[1] https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api

пт, 19 февр. 2021 г. в 16:41, Stephen Darlington <[hidden email]>:
I think it’s more that that putAll is mostly atomic, so the more records you save in one chunk, the more locking, etc. happens. Distributing as compute jobs means all the putAlls will be local which is beneficial, and the size of each put is going to be smaller (also beneficial).

But that’s a lot of work that the data streamer already does for you and the data streamer also batches updates so would still be faster.

On 19 Feb 2021, at 13:33, Maximiliano Gazquez <[hidden email]> wrote:

What would be the difference between doing cache.putAll(all rows) and separating them by affinity key+executing putAll inside a compute job.
If I'm not mistaken, doing putAll should end up splitting those rows by affinity key in one of the servers, right?
Is there a comparison of that?

On Fri, Feb 19, 2021 at 9:51 AM Taras Ledkov <[hidden email]> wrote:

Hi Vladimir,

Did you try to use SQL command 'COPY FROM <csv_file>' via thin JDBC?
This command uses 'IgniteDataStreamer' to write data into cluster and parse CSV on the server node.

PS. AFAIK IgniteDataStreamer is one of the fastest ways to load data.

Hi Denis,

Data space is 3.7Gb according to MSSQL table properries

Vladimir

9:47, 19 февраля 2021 г., Denis Magda [hidden email]:
Hello Vladimir, 

Good to hear from you! How much is that in gigabytes?

-
Denis


On Thu, Feb 18, 2021 at 10:06 PM <[hidden email]> wrote:
Sep 2020 I've published the paper about Loading Large Datasets into Apache Ignite by Using a Key-Value API (English [1] and Russian [2] version). The approach described works in production, but shows inacceptable perfomance for very large tables.

The story continues, and yesterday I've finished the proof of concept for very fast loading of very big table. The partitioned MSSQL table about 295 million rows was loaded by the 4-node Ignite cluster in 3 min 35 sec. Each node had executed its own SQL queries in parallel and then distributed the loaded values across the other cluster nodes.

Probably that result will be of interest for the community.

Regards,
Vladimir Chernyi




--
Отправлено из мобильного приложения Яндекс.Почты
-- 
Taras Ledkov
Mail-To: [hidden email]


Vladimir Tchernyi Vladimir Tchernyi
Reply | Threaded
Open this post in threaded view
|

Re: very fast loading of very big table

Hi Pavel,

the code [1] you shared is a kind of in-memory experiment with all the processes inside a single JVM. My work differs - it is from the big retail business, and hence it is 100% practice-oriented. True to say, it's oriented to the state of things inside my company, and that is my question - will my results be interesting to the community? I have seen a lot of questions on the user list regarding data loading and difficulties here seem to be a blocker in extending Ignite's popularity.

Please let me know if my case is not common in the industry. We have a big bare-metal Windows MSSQL server and a number of bare metal hosts, each with the virtualization software and a single CentOs virtual server inside. These CentOs hosts currently form an Ignite cluster with 4 data nodes and 1 client node. The example [2] I published last year is intended to solve the business problem we have out here:
1) the data currently present in the cluster have zero value;
2) actual data is in the database and must be loaded in the cluster ASAP. We use BinaryObject as cache key and value;
3) cluster performs some data processing and writes the result to the database.

Unfortunately, the code [2] does not 100% OK in my case, it tends to say "is node still alive" and to drop the client node off the cluster. The performance of the MSSQL and network is what it is, I consider it as a given restriction. It seems I got some progress when managed to move the data loading process from a single client node to multiple data nodes. When the extra data nodes will be added, I expect the load performance will be better. Of course, until my big MSSQL will be able to hold the load. So I want to know how interesting my results will be if it will be published.

WDYT?

чт, 25 февр. 2021 г. в 11:01, Pavel Tupitsyn <[hidden email]>:
Vladimir,

Thanks for getting back to us. A full example that clarifies the situation will be great!

> Can you share your code as a GitHub project? Maybe with the script to reproduce 6 GB of data.

It is super trivial, I just wanted to get a sense of the throughput and check if we have some kind of a regression in recent versions (we don't) [1]
Also I realised that the data size can be counted very differently - do we account for DB overhead and how?


On Thu, Feb 25, 2021 at 10:49 AM Vladimir Tchernyi <[hidden email]> wrote:
Hi, 

I'd spent some time thinking about the community comments on my post. It seems that Ignite is really not a bottleneck here. The performance of my production MSSQL is a given restriction and the problem is to ensure fast loading by executing multiple parallel queries. I'll test my code in production for a couple of months for possible problems. If it will be OK, probably the complete/downloadable/compilable GitHub example will be useful for the community.

WDYT?

пт, 19 февр. 2021 г. в 21:47, Vladimir Tchernyi <[hidden email]>:
Pavel,

maybe it's time to put your five-cent in. Can you share your code as a GitHub project? Maybe with the script to reproduce 6 GB of data.

As for MSSQL data retrieval being the bottleneck - don't think so, I got 15 min load time for 1 node and 3.5 min time for 4 nodes. Looks like a linear dependency (the table and the RDBMS server were the same).
--
Vladimir

пт, 19 февр. 2021 г. в 19:47, Pavel Tupitsyn <[hidden email]>:
> First of all, I tried to select the whole table as once

Hmm, it looks like MSSQL data retrieval may be the bottleneck here, not Ignite.

Can you run a test where some dummy data of the same size as real data is generated and inserted into Ignite,
so that we test Ignite perf only, excluding MSSQL from the equation?
For example, streaming 300 million entries (total size 6 GB) takes around 1 minute on my machine, with a simple single-threaded DataStreamer.

On Fri, Feb 19, 2021 at 4:49 PM Vladimir Tchernyi <[hidden email]> wrote:
Hi folks,
thanks for your interest in my work.

I didn't try COPY FROM since I've tried to work with Ignite SQL a couple of years ago and didn't succeed. Probably because examples available aren't complete/downloadable/compilable (the paper [1] contains GitHub repo, that is my five cents in changing the status quo). My interest is in KV API.

I did try a data streamer, and that was my first try. I did not notice a significant time reduction in using code from my paper [1] versus data streamer/receiver. There was some memory economy with the streamer, though. I must say my experiment was made on a heavily loaded production mssql server. Filtered query with 300K rows resultset takes about 15 sec. The story follows.

First of all, I tried to select the whole table as once, I got the network timeout and the client node was dropped off the cluster (is node still alive?).
So I'd partitioned the table and executed a number of queries one-by-one on the client node, each query for the specific table partition. That process took about 90 min. Inacceptable time.

Then I tried to execute my queries in parallel on the client node, each query executing dataStreamer.addData() for a single dataStreamer. The timing was not less than 15 min. All the attempts were the same, probably that was the network throughput limit on the client node (same interface used for the resultset and for cluster intercom). Say it again - that was the production environment.

Final schema:
* ComputeTask.map() schedules ComputeJobs amongst cluster nodes, one job for one table partition;
* each job executes SQL query, constructs a map with binary object key and value. Then the job executes targetCache.invokeAll() specifying the constructed map and the static EntryProcessor class. The EntryProcessor contains the logic for cache binary entry update;
* ComputeTask.reduce() summarizes the row count reported by each job.

The schema described proved to be network error-free in my production network and gives acceptable timing.

Vladimir

[1] https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api

пт, 19 февр. 2021 г. в 16:41, Stephen Darlington <[hidden email]>:
I think it’s more that that putAll is mostly atomic, so the more records you save in one chunk, the more locking, etc. happens. Distributing as compute jobs means all the putAlls will be local which is beneficial, and the size of each put is going to be smaller (also beneficial).

But that’s a lot of work that the data streamer already does for you and the data streamer also batches updates so would still be faster.

On 19 Feb 2021, at 13:33, Maximiliano Gazquez <[hidden email]> wrote:

What would be the difference between doing cache.putAll(all rows) and separating them by affinity key+executing putAll inside a compute job.
If I'm not mistaken, doing putAll should end up splitting those rows by affinity key in one of the servers, right?
Is there a comparison of that?

On Fri, Feb 19, 2021 at 9:51 AM Taras Ledkov <[hidden email]> wrote:

Hi Vladimir,

Did you try to use SQL command 'COPY FROM <csv_file>' via thin JDBC?
This command uses 'IgniteDataStreamer' to write data into cluster and parse CSV on the server node.

PS. AFAIK IgniteDataStreamer is one of the fastest ways to load data.

Hi Denis,

Data space is 3.7Gb according to MSSQL table properries

Vladimir

9:47, 19 февраля 2021 г., Denis Magda [hidden email]:
Hello Vladimir, 

Good to hear from you! How much is that in gigabytes?

-
Denis


On Thu, Feb 18, 2021 at 10:06 PM <[hidden email]> wrote:
Sep 2020 I've published the paper about Loading Large Datasets into Apache Ignite by Using a Key-Value API (English [1] and Russian [2] version). The approach described works in production, but shows inacceptable perfomance for very large tables.

The story continues, and yesterday I've finished the proof of concept for very fast loading of very big table. The partitioned MSSQL table about 295 million rows was loaded by the 4-node Ignite cluster in 3 min 35 sec. Each node had executed its own SQL queries in parallel and then distributed the loaded values across the other cluster nodes.

Probably that result will be of interest for the community.

Regards,
Vladimir Chernyi




--
Отправлено из мобильного приложения Яндекс.Почты
-- 
Taras Ledkov
Mail-To: [hidden email]


ptupitsyn ptupitsyn
Reply | Threaded
Open this post in threaded view
|

Re: very fast loading of very big table

Vladimir,

I think all real-world use cases are very valuable to the community.
However, we should be careful to avoid misleading conclusions.

We have well-known patterns for loading data from other systems: DataStreamer [1] and CacheStore [2].
The article [3] seems a bit confusing to me, since none of those two patterns are mentioned there.
When proposing a custom approach, it would be great to compare it to the standard alternatives.


On Fri, Feb 26, 2021 at 9:19 AM Vladimir Tchernyi <[hidden email]> wrote:
Hi Pavel,

the code [1] you shared is a kind of in-memory experiment with all the processes inside a single JVM. My work differs - it is from the big retail business, and hence it is 100% practice-oriented. True to say, it's oriented to the state of things inside my company, and that is my question - will my results be interesting to the community? I have seen a lot of questions on the user list regarding data loading and difficulties here seem to be a blocker in extending Ignite's popularity.

Please let me know if my case is not common in the industry. We have a big bare-metal Windows MSSQL server and a number of bare metal hosts, each with the virtualization software and a single CentOs virtual server inside. These CentOs hosts currently form an Ignite cluster with 4 data nodes and 1 client node. The example [2] I published last year is intended to solve the business problem we have out here:
1) the data currently present in the cluster have zero value;
2) actual data is in the database and must be loaded in the cluster ASAP. We use BinaryObject as cache key and value;
3) cluster performs some data processing and writes the result to the database.

Unfortunately, the code [2] does not 100% OK in my case, it tends to say "is node still alive" and to drop the client node off the cluster. The performance of the MSSQL and network is what it is, I consider it as a given restriction. It seems I got some progress when managed to move the data loading process from a single client node to multiple data nodes. When the extra data nodes will be added, I expect the load performance will be better. Of course, until my big MSSQL will be able to hold the load. So I want to know how interesting my results will be if it will be published.

WDYT?

чт, 25 февр. 2021 г. в 11:01, Pavel Tupitsyn <[hidden email]>:
Vladimir,

Thanks for getting back to us. A full example that clarifies the situation will be great!

> Can you share your code as a GitHub project? Maybe with the script to reproduce 6 GB of data.

It is super trivial, I just wanted to get a sense of the throughput and check if we have some kind of a regression in recent versions (we don't) [1]
Also I realised that the data size can be counted very differently - do we account for DB overhead and how?


On Thu, Feb 25, 2021 at 10:49 AM Vladimir Tchernyi <[hidden email]> wrote:
Hi, 

I'd spent some time thinking about the community comments on my post. It seems that Ignite is really not a bottleneck here. The performance of my production MSSQL is a given restriction and the problem is to ensure fast loading by executing multiple parallel queries. I'll test my code in production for a couple of months for possible problems. If it will be OK, probably the complete/downloadable/compilable GitHub example will be useful for the community.

WDYT?

пт, 19 февр. 2021 г. в 21:47, Vladimir Tchernyi <[hidden email]>:
Pavel,

maybe it's time to put your five-cent in. Can you share your code as a GitHub project? Maybe with the script to reproduce 6 GB of data.

As for MSSQL data retrieval being the bottleneck - don't think so, I got 15 min load time for 1 node and 3.5 min time for 4 nodes. Looks like a linear dependency (the table and the RDBMS server were the same).
--
Vladimir

пт, 19 февр. 2021 г. в 19:47, Pavel Tupitsyn <[hidden email]>:
> First of all, I tried to select the whole table as once

Hmm, it looks like MSSQL data retrieval may be the bottleneck here, not Ignite.

Can you run a test where some dummy data of the same size as real data is generated and inserted into Ignite,
so that we test Ignite perf only, excluding MSSQL from the equation?
For example, streaming 300 million entries (total size 6 GB) takes around 1 minute on my machine, with a simple single-threaded DataStreamer.

On Fri, Feb 19, 2021 at 4:49 PM Vladimir Tchernyi <[hidden email]> wrote:
Hi folks,
thanks for your interest in my work.

I didn't try COPY FROM since I've tried to work with Ignite SQL a couple of years ago and didn't succeed. Probably because examples available aren't complete/downloadable/compilable (the paper [1] contains GitHub repo, that is my five cents in changing the status quo). My interest is in KV API.

I did try a data streamer, and that was my first try. I did not notice a significant time reduction in using code from my paper [1] versus data streamer/receiver. There was some memory economy with the streamer, though. I must say my experiment was made on a heavily loaded production mssql server. Filtered query with 300K rows resultset takes about 15 sec. The story follows.

First of all, I tried to select the whole table as once, I got the network timeout and the client node was dropped off the cluster (is node still alive?).
So I'd partitioned the table and executed a number of queries one-by-one on the client node, each query for the specific table partition. That process took about 90 min. Inacceptable time.

Then I tried to execute my queries in parallel on the client node, each query executing dataStreamer.addData() for a single dataStreamer. The timing was not less than 15 min. All the attempts were the same, probably that was the network throughput limit on the client node (same interface used for the resultset and for cluster intercom). Say it again - that was the production environment.

Final schema:
* ComputeTask.map() schedules ComputeJobs amongst cluster nodes, one job for one table partition;
* each job executes SQL query, constructs a map with binary object key and value. Then the job executes targetCache.invokeAll() specifying the constructed map and the static EntryProcessor class. The EntryProcessor contains the logic for cache binary entry update;
* ComputeTask.reduce() summarizes the row count reported by each job.

The schema described proved to be network error-free in my production network and gives acceptable timing.

Vladimir

[1] https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api

пт, 19 февр. 2021 г. в 16:41, Stephen Darlington <[hidden email]>:
I think it’s more that that putAll is mostly atomic, so the more records you save in one chunk, the more locking, etc. happens. Distributing as compute jobs means all the putAlls will be local which is beneficial, and the size of each put is going to be smaller (also beneficial).

But that’s a lot of work that the data streamer already does for you and the data streamer also batches updates so would still be faster.

On 19 Feb 2021, at 13:33, Maximiliano Gazquez <[hidden email]> wrote:

What would be the difference between doing cache.putAll(all rows) and separating them by affinity key+executing putAll inside a compute job.
If I'm not mistaken, doing putAll should end up splitting those rows by affinity key in one of the servers, right?
Is there a comparison of that?

On Fri, Feb 19, 2021 at 9:51 AM Taras Ledkov <[hidden email]> wrote:

Hi Vladimir,

Did you try to use SQL command 'COPY FROM <csv_file>' via thin JDBC?
This command uses 'IgniteDataStreamer' to write data into cluster and parse CSV on the server node.

PS. AFAIK IgniteDataStreamer is one of the fastest ways to load data.

Hi Denis,

Data space is 3.7Gb according to MSSQL table properries

Vladimir

9:47, 19 февраля 2021 г., Denis Magda [hidden email]:
Hello Vladimir, 

Good to hear from you! How much is that in gigabytes?

-
Denis


On Thu, Feb 18, 2021 at 10:06 PM <[hidden email]> wrote:
Sep 2020 I've published the paper about Loading Large Datasets into Apache Ignite by Using a Key-Value API (English [1] and Russian [2] version). The approach described works in production, but shows inacceptable perfomance for very large tables.

The story continues, and yesterday I've finished the proof of concept for very fast loading of very big table. The partitioned MSSQL table about 295 million rows was loaded by the 4-node Ignite cluster in 3 min 35 sec. Each node had executed its own SQL queries in parallel and then distributed the loaded values across the other cluster nodes.

Probably that result will be of interest for the community.

Regards,
Vladimir Chernyi




--
Отправлено из мобильного приложения Яндекс.Почты
-- 
Taras Ledkov
Mail-To: [hidden email]


Vladimir Tchernyi Vladimir Tchernyi
Reply | Threaded
Open this post in threaded view
|

Re: very fast loading of very big table

Pavel,

thanks for mentioning the patterns. Of course, I spent a lot of time reading documentation, [2] at the very beginning and [1] a couple of months ago. Here is the origin of my pain-in-the-neck about a complete GitHub example - none of [1] and [2] give an answer about my problem. The keyword in my case is ASAP, there should be a multithreaded example. Of course, the real-world example must not use a primitive types as a cache values, I tried to illustrate that in [3].

I'd built one with the data streamer [1], it seems I was limited by network adapter performance (see my previous post in this thread). That is the reason I decided to move SQL queries to the data nodes.

Vladimir

пт, 26 февр. 2021 г. в 10:54, Pavel Tupitsyn <[hidden email]>:
Vladimir,

I think all real-world use cases are very valuable to the community.
However, we should be careful to avoid misleading conclusions.

We have well-known patterns for loading data from other systems: DataStreamer [1] and CacheStore [2].
The article [3] seems a bit confusing to me, since none of those two patterns are mentioned there.
When proposing a custom approach, it would be great to compare it to the standard alternatives.


On Fri, Feb 26, 2021 at 9:19 AM Vladimir Tchernyi <[hidden email]> wrote:
Hi Pavel,

the code [1] you shared is a kind of in-memory experiment with all the processes inside a single JVM. My work differs - it is from the big retail business, and hence it is 100% practice-oriented. True to say, it's oriented to the state of things inside my company, and that is my question - will my results be interesting to the community? I have seen a lot of questions on the user list regarding data loading and difficulties here seem to be a blocker in extending Ignite's popularity.

Please let me know if my case is not common in the industry. We have a big bare-metal Windows MSSQL server and a number of bare metal hosts, each with the virtualization software and a single CentOs virtual server inside. These CentOs hosts currently form an Ignite cluster with 4 data nodes and 1 client node. The example [2] I published last year is intended to solve the business problem we have out here:
1) the data currently present in the cluster have zero value;
2) actual data is in the database and must be loaded in the cluster ASAP. We use BinaryObject as cache key and value;
3) cluster performs some data processing and writes the result to the database.

Unfortunately, the code [2] does not 100% OK in my case, it tends to say "is node still alive" and to drop the client node off the cluster. The performance of the MSSQL and network is what it is, I consider it as a given restriction. It seems I got some progress when managed to move the data loading process from a single client node to multiple data nodes. When the extra data nodes will be added, I expect the load performance will be better. Of course, until my big MSSQL will be able to hold the load. So I want to know how interesting my results will be if it will be published.

WDYT?

чт, 25 февр. 2021 г. в 11:01, Pavel Tupitsyn <[hidden email]>:
Vladimir,

Thanks for getting back to us. A full example that clarifies the situation will be great!

> Can you share your code as a GitHub project? Maybe with the script to reproduce 6 GB of data.

It is super trivial, I just wanted to get a sense of the throughput and check if we have some kind of a regression in recent versions (we don't) [1]
Also I realised that the data size can be counted very differently - do we account for DB overhead and how?


On Thu, Feb 25, 2021 at 10:49 AM Vladimir Tchernyi <[hidden email]> wrote:
Hi, 

I'd spent some time thinking about the community comments on my post. It seems that Ignite is really not a bottleneck here. The performance of my production MSSQL is a given restriction and the problem is to ensure fast loading by executing multiple parallel queries. I'll test my code in production for a couple of months for possible problems. If it will be OK, probably the complete/downloadable/compilable GitHub example will be useful for the community.

WDYT?

пт, 19 февр. 2021 г. в 21:47, Vladimir Tchernyi <[hidden email]>:
Pavel,

maybe it's time to put your five-cent in. Can you share your code as a GitHub project? Maybe with the script to reproduce 6 GB of data.

As for MSSQL data retrieval being the bottleneck - don't think so, I got 15 min load time for 1 node and 3.5 min time for 4 nodes. Looks like a linear dependency (the table and the RDBMS server were the same).
--
Vladimir

пт, 19 февр. 2021 г. в 19:47, Pavel Tupitsyn <[hidden email]>:
> First of all, I tried to select the whole table as once

Hmm, it looks like MSSQL data retrieval may be the bottleneck here, not Ignite.

Can you run a test where some dummy data of the same size as real data is generated and inserted into Ignite,
so that we test Ignite perf only, excluding MSSQL from the equation?
For example, streaming 300 million entries (total size 6 GB) takes around 1 minute on my machine, with a simple single-threaded DataStreamer.

On Fri, Feb 19, 2021 at 4:49 PM Vladimir Tchernyi <[hidden email]> wrote:
Hi folks,
thanks for your interest in my work.

I didn't try COPY FROM since I've tried to work with Ignite SQL a couple of years ago and didn't succeed. Probably because examples available aren't complete/downloadable/compilable (the paper [1] contains GitHub repo, that is my five cents in changing the status quo). My interest is in KV API.

I did try a data streamer, and that was my first try. I did not notice a significant time reduction in using code from my paper [1] versus data streamer/receiver. There was some memory economy with the streamer, though. I must say my experiment was made on a heavily loaded production mssql server. Filtered query with 300K rows resultset takes about 15 sec. The story follows.

First of all, I tried to select the whole table as once, I got the network timeout and the client node was dropped off the cluster (is node still alive?).
So I'd partitioned the table and executed a number of queries one-by-one on the client node, each query for the specific table partition. That process took about 90 min. Inacceptable time.

Then I tried to execute my queries in parallel on the client node, each query executing dataStreamer.addData() for a single dataStreamer. The timing was not less than 15 min. All the attempts were the same, probably that was the network throughput limit on the client node (same interface used for the resultset and for cluster intercom). Say it again - that was the production environment.

Final schema:
* ComputeTask.map() schedules ComputeJobs amongst cluster nodes, one job for one table partition;
* each job executes SQL query, constructs a map with binary object key and value. Then the job executes targetCache.invokeAll() specifying the constructed map and the static EntryProcessor class. The EntryProcessor contains the logic for cache binary entry update;
* ComputeTask.reduce() summarizes the row count reported by each job.

The schema described proved to be network error-free in my production network and gives acceptable timing.

Vladimir

[1] https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api

пт, 19 февр. 2021 г. в 16:41, Stephen Darlington <[hidden email]>:
I think it’s more that that putAll is mostly atomic, so the more records you save in one chunk, the more locking, etc. happens. Distributing as compute jobs means all the putAlls will be local which is beneficial, and the size of each put is going to be smaller (also beneficial).

But that’s a lot of work that the data streamer already does for you and the data streamer also batches updates so would still be faster.

On 19 Feb 2021, at 13:33, Maximiliano Gazquez <[hidden email]> wrote:

What would be the difference between doing cache.putAll(all rows) and separating them by affinity key+executing putAll inside a compute job.
If I'm not mistaken, doing putAll should end up splitting those rows by affinity key in one of the servers, right?
Is there a comparison of that?

On Fri, Feb 19, 2021 at 9:51 AM Taras Ledkov <[hidden email]> wrote:

Hi Vladimir,

Did you try to use SQL command 'COPY FROM <csv_file>' via thin JDBC?
This command uses 'IgniteDataStreamer' to write data into cluster and parse CSV on the server node.

PS. AFAIK IgniteDataStreamer is one of the fastest ways to load data.

Hi Denis,

Data space is 3.7Gb according to MSSQL table properries

Vladimir

9:47, 19 февраля 2021 г., Denis Magda [hidden email]:
Hello Vladimir, 

Good to hear from you! How much is that in gigabytes?

-
Denis


On Thu, Feb 18, 2021 at 10:06 PM <[hidden email]> wrote:
Sep 2020 I've published the paper about Loading Large Datasets into Apache Ignite by Using a Key-Value API (English [1] and Russian [2] version). The approach described works in production, but shows inacceptable perfomance for very large tables.

The story continues, and yesterday I've finished the proof of concept for very fast loading of very big table. The partitioned MSSQL table about 295 million rows was loaded by the 4-node Ignite cluster in 3 min 35 sec. Each node had executed its own SQL queries in parallel and then distributed the loaded values across the other cluster nodes.

Probably that result will be of interest for the community.

Regards,
Vladimir Chernyi




--
Отправлено из мобильного приложения Яндекс.Почты
-- 
Taras Ledkov
Mail-To: [hidden email]