Streamer and data loss

classic Classic list List threaded Threaded
10 messages Options
narges saleh narges saleh
Reply | Threaded
Open this post in threaded view
|

Streamer and data loss

Hi All,

Another question regarding ignite's streamer.
What happens to the data if the streamer node crashes before the buffer's content is flushed to the cache? Is the client responsible for making sure the data is persisted or ignite redirects the data to another node's streamer?

thanks.
aealexsandrov aealexsandrov
Reply | Threaded
Open this post in threaded view
|

Re: Streamer and data loss

Hi,

Not flushed data in a data streamer will be lost. Data streamer works
thought some Ignite node and in case if this the node failed it can't
somehow start working with another one. So your application should think
about how to track that all data was loaded (wait for completion of
loading, catch the exceptions, check the cache sizes, etc) and use
another client for data loading in case if previous one was failed.

BR,
Andrei

1/6/2020 2:37 AM, narges saleh пишет:
> Hi All,
>
> Another question regarding ignite's streamer.
> What happens to the data if the streamer node crashes before the
> buffer's content is flushed to the cache? Is the client responsible
> for making sure the data is persisted or ignite redirects the data to
> another node's streamer?
>
> thanks.
narges saleh narges saleh
Reply | Threaded
Open this post in threaded view
|

Re: Streamer and data loss

Thanks Andrei.

If the external data source client sending batches of 2-3 MB say via TCP socket connection to a bunch of socket streamers (deployed as ignite services deployed to each ignite node) and say of the streamer nodes die, the data source client catching the exception, has to check the cache to see how much of the 2-4MB batch has been flushed to cache and resend the rest? Would setting streamer with overwrite set to true work, if the data source client resend the entire batch?
A question regarding streamer with overwrite option set to true. How does the streamer compare the content the data in hand with the data in cache, if each record is being assigned UUID when being  inserted to cache?


On Tue, Jan 7, 2020 at 4:40 AM Andrei Aleksandrov <[hidden email]> wrote:
Hi,

Not flushed data in a data streamer will be lost. Data streamer works
thought some Ignite node and in case if this the node failed it can't
somehow start working with another one. So your application should think
about how to track that all data was loaded (wait for completion of
loading, catch the exceptions, check the cache sizes, etc) and use
another client for data loading in case if previous one was failed.

BR,
Andrei

1/6/2020 2:37 AM, narges saleh пишет:
> Hi All,
>
> Another question regarding ignite's streamer.
> What happens to the data if the streamer node crashes before the
> buffer's content is flushed to the cache? Is the client responsible
> for making sure the data is persisted or ignite redirects the data to
> another node's streamer?
>
> thanks.
Saikat Maitra Saikat Maitra
Reply | Threaded
Open this post in threaded view
|

Re: Streamer and data loss

Hi,

AFAIK, the DataStreamer check for presence of key and if it is present in the cache then it does not allow overwrite of value if allowOverwrite is set to false.

Regards,
Saikat

On Thu, Jan 9, 2020 at 6:04 AM narges saleh <[hidden email]> wrote:
Thanks Andrei.

If the external data source client sending batches of 2-3 MB say via TCP socket connection to a bunch of socket streamers (deployed as ignite services deployed to each ignite node) and say of the streamer nodes die, the data source client catching the exception, has to check the cache to see how much of the 2-4MB batch has been flushed to cache and resend the rest? Would setting streamer with overwrite set to true work, if the data source client resend the entire batch?
A question regarding streamer with overwrite option set to true. How does the streamer compare the content the data in hand with the data in cache, if each record is being assigned UUID when being  inserted to cache?


On Tue, Jan 7, 2020 at 4:40 AM Andrei Aleksandrov <[hidden email]> wrote:
Hi,

Not flushed data in a data streamer will be lost. Data streamer works
thought some Ignite node and in case if this the node failed it can't
somehow start working with another one. So your application should think
about how to track that all data was loaded (wait for completion of
loading, catch the exceptions, check the cache sizes, etc) and use
another client for data loading in case if previous one was failed.

BR,
Andrei

1/6/2020 2:37 AM, narges saleh пишет:
> Hi All,
>
> Another question regarding ignite's streamer.
> What happens to the data if the streamer node crashes before the
> buffer's content is flushed to the cache? Is the client responsible
> for making sure the data is persisted or ignite redirects the data to
> another node's streamer?
>
> thanks.
narges saleh narges saleh
Reply | Threaded
Open this post in threaded view
|

Re: Streamer and data loss

Thanks Saikat for the feedback.

But if I use the overwrite option set to true to avoid duplicates in case I have to resend the entire payload in case of a streamer node failure, then I won't
 get optimal performance, right?
What's the best practice for dealing with data streamer node failures? Are there examples?

On Thu, Jan 9, 2020 at 9:12 PM Saikat Maitra <[hidden email]> wrote:
Hi,

AFAIK, the DataStreamer check for presence of key and if it is present in the cache then it does not allow overwrite of value if allowOverwrite is set to false.

Regards,
Saikat

On Thu, Jan 9, 2020 at 6:04 AM narges saleh <[hidden email]> wrote:
Thanks Andrei.

If the external data source client sending batches of 2-3 MB say via TCP socket connection to a bunch of socket streamers (deployed as ignite services deployed to each ignite node) and say of the streamer nodes die, the data source client catching the exception, has to check the cache to see how much of the 2-4MB batch has been flushed to cache and resend the rest? Would setting streamer with overwrite set to true work, if the data source client resend the entire batch?
A question regarding streamer with overwrite option set to true. How does the streamer compare the content the data in hand with the data in cache, if each record is being assigned UUID when being  inserted to cache?


On Tue, Jan 7, 2020 at 4:40 AM Andrei Aleksandrov <[hidden email]> wrote:
Hi,

Not flushed data in a data streamer will be lost. Data streamer works
thought some Ignite node and in case if this the node failed it can't
somehow start working with another one. So your application should think
about how to track that all data was loaded (wait for completion of
loading, catch the exceptions, check the cache sizes, etc) and use
another client for data loading in case if previous one was failed.

BR,
Andrei

1/6/2020 2:37 AM, narges saleh пишет:
> Hi All,
>
> Another question regarding ignite's streamer.
> What happens to the data if the streamer node crashes before the
> buffer's content is flushed to the cache? Is the client responsible
> for making sure the data is persisted or ignite redirects the data to
> another node's streamer?
>
> thanks.
Saikat Maitra Saikat Maitra
Reply | Threaded
Open this post in threaded view
|

Re: Streamer and data loss

Hi,

To minimise data loss during streamer node failure I think we can use the following steps:

1. Use autoFlushFrequency param to set the desired flush frequency, depending on desired consistency level and performance you can choose how frequently you would like the data to be flush to Ignite nodes.

2. Develop a automated checkpointing process to capture and store the source data offset, it can be something like kafka message offset or cache keys if keys are sequential or timestamp for last flush and depending on that the Ignite client can restart the data streaming process from last checkpoint if there are node failure.

HTH

Regards,
Saikat

On Fri, Jan 10, 2020 at 4:34 AM narges saleh <[hidden email]> wrote:
Thanks Saikat for the feedback.

But if I use the overwrite option set to true to avoid duplicates in case I have to resend the entire payload in case of a streamer node failure, then I won't
 get optimal performance, right?
What's the best practice for dealing with data streamer node failures? Are there examples?

On Thu, Jan 9, 2020 at 9:12 PM Saikat Maitra <[hidden email]> wrote:
Hi,

AFAIK, the DataStreamer check for presence of key and if it is present in the cache then it does not allow overwrite of value if allowOverwrite is set to false.

Regards,
Saikat

On Thu, Jan 9, 2020 at 6:04 AM narges saleh <[hidden email]> wrote:
Thanks Andrei.

If the external data source client sending batches of 2-3 MB say via TCP socket connection to a bunch of socket streamers (deployed as ignite services deployed to each ignite node) and say of the streamer nodes die, the data source client catching the exception, has to check the cache to see how much of the 2-4MB batch has been flushed to cache and resend the rest? Would setting streamer with overwrite set to true work, if the data source client resend the entire batch?
A question regarding streamer with overwrite option set to true. How does the streamer compare the content the data in hand with the data in cache, if each record is being assigned UUID when being  inserted to cache?


On Tue, Jan 7, 2020 at 4:40 AM Andrei Aleksandrov <[hidden email]> wrote:
Hi,

Not flushed data in a data streamer will be lost. Data streamer works
thought some Ignite node and in case if this the node failed it can't
somehow start working with another one. So your application should think
about how to track that all data was loaded (wait for completion of
loading, catch the exceptions, check the cache sizes, etc) and use
another client for data loading in case if previous one was failed.

BR,
Andrei

1/6/2020 2:37 AM, narges saleh пишет:
> Hi All,
>
> Another question regarding ignite's streamer.
> What happens to the data if the streamer node crashes before the
> buffer's content is flushed to the cache? Is the client responsible
> for making sure the data is persisted or ignite redirects the data to
> another node's streamer?
>
> thanks.
narges saleh narges saleh
Reply | Threaded
Open this post in threaded view
|

Re: Streamer and data loss

Thanks Saikat.

I am not sure if sequential keys/timestamps and Kafka like offsets would help if there are many data source clients and many streamer nodes in play; depending on the checkpoint, we might still end up duplicates (unless you're saying each client sequences its payload before sending it to the streamer; even then duplicates are possible on the cache). The only sure way, it seems to me, is for the client that catches the exception to check the cache and only resend the diff, which make things very complex. The other approach, if I am right is, to enable overwrite, so the streamer would dedup the data in cache. The latter is costly too. I think the ideal approach would have been if there were some type of streamer resiliency in place where another streamer node could pick up the buffer from a crashed streamer and continue the work.


On Wed, Jan 15, 2020 at 9:00 PM Saikat Maitra <[hidden email]> wrote:
Hi,

To minimise data loss during streamer node failure I think we can use the following steps:

1. Use autoFlushFrequency param to set the desired flush frequency, depending on desired consistency level and performance you can choose how frequently you would like the data to be flush to Ignite nodes.

2. Develop a automated checkpointing process to capture and store the source data offset, it can be something like kafka message offset or cache keys if keys are sequential or timestamp for last flush and depending on that the Ignite client can restart the data streaming process from last checkpoint if there are node failure.

HTH

Regards,
Saikat

On Fri, Jan 10, 2020 at 4:34 AM narges saleh <[hidden email]> wrote:
Thanks Saikat for the feedback.

But if I use the overwrite option set to true to avoid duplicates in case I have to resend the entire payload in case of a streamer node failure, then I won't
 get optimal performance, right?
What's the best practice for dealing with data streamer node failures? Are there examples?

On Thu, Jan 9, 2020 at 9:12 PM Saikat Maitra <[hidden email]> wrote:
Hi,

AFAIK, the DataStreamer check for presence of key and if it is present in the cache then it does not allow overwrite of value if allowOverwrite is set to false.

Regards,
Saikat

On Thu, Jan 9, 2020 at 6:04 AM narges saleh <[hidden email]> wrote:
Thanks Andrei.

If the external data source client sending batches of 2-3 MB say via TCP socket connection to a bunch of socket streamers (deployed as ignite services deployed to each ignite node) and say of the streamer nodes die, the data source client catching the exception, has to check the cache to see how much of the 2-4MB batch has been flushed to cache and resend the rest? Would setting streamer with overwrite set to true work, if the data source client resend the entire batch?
A question regarding streamer with overwrite option set to true. How does the streamer compare the content the data in hand with the data in cache, if each record is being assigned UUID when being  inserted to cache?


On Tue, Jan 7, 2020 at 4:40 AM Andrei Aleksandrov <[hidden email]> wrote:
Hi,

Not flushed data in a data streamer will be lost. Data streamer works
thought some Ignite node and in case if this the node failed it can't
somehow start working with another one. So your application should think
about how to track that all data was loaded (wait for completion of
loading, catch the exceptions, check the cache sizes, etc) and use
another client for data loading in case if previous one was failed.

BR,
Andrei

1/6/2020 2:37 AM, narges saleh пишет:
> Hi All,
>
> Another question regarding ignite's streamer.
> What happens to the data if the streamer node crashes before the
> buffer's content is flushed to the cache? Is the client responsible
> for making sure the data is persisted or ignite redirects the data to
> another node's streamer?
>
> thanks.
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Streamer and data loss

Hello!

I think you should consider using putAll() operation if resiliency is important for you, since this operation will be salvaged if initiator node fails.

Regards,
--
Ilya Kasnacheev


чт, 16 янв. 2020 г. в 15:48, narges saleh <[hidden email]>:
Thanks Saikat.

I am not sure if sequential keys/timestamps and Kafka like offsets would help if there are many data source clients and many streamer nodes in play; depending on the checkpoint, we might still end up duplicates (unless you're saying each client sequences its payload before sending it to the streamer; even then duplicates are possible on the cache). The only sure way, it seems to me, is for the client that catches the exception to check the cache and only resend the diff, which make things very complex. The other approach, if I am right is, to enable overwrite, so the streamer would dedup the data in cache. The latter is costly too. I think the ideal approach would have been if there were some type of streamer resiliency in place where another streamer node could pick up the buffer from a crashed streamer and continue the work.


On Wed, Jan 15, 2020 at 9:00 PM Saikat Maitra <[hidden email]> wrote:
Hi,

To minimise data loss during streamer node failure I think we can use the following steps:

1. Use autoFlushFrequency param to set the desired flush frequency, depending on desired consistency level and performance you can choose how frequently you would like the data to be flush to Ignite nodes.

2. Develop a automated checkpointing process to capture and store the source data offset, it can be something like kafka message offset or cache keys if keys are sequential or timestamp for last flush and depending on that the Ignite client can restart the data streaming process from last checkpoint if there are node failure.

HTH

Regards,
Saikat

On Fri, Jan 10, 2020 at 4:34 AM narges saleh <[hidden email]> wrote:
Thanks Saikat for the feedback.

But if I use the overwrite option set to true to avoid duplicates in case I have to resend the entire payload in case of a streamer node failure, then I won't
 get optimal performance, right?
What's the best practice for dealing with data streamer node failures? Are there examples?

On Thu, Jan 9, 2020 at 9:12 PM Saikat Maitra <[hidden email]> wrote:
Hi,

AFAIK, the DataStreamer check for presence of key and if it is present in the cache then it does not allow overwrite of value if allowOverwrite is set to false.

Regards,
Saikat

On Thu, Jan 9, 2020 at 6:04 AM narges saleh <[hidden email]> wrote:
Thanks Andrei.

If the external data source client sending batches of 2-3 MB say via TCP socket connection to a bunch of socket streamers (deployed as ignite services deployed to each ignite node) and say of the streamer nodes die, the data source client catching the exception, has to check the cache to see how much of the 2-4MB batch has been flushed to cache and resend the rest? Would setting streamer with overwrite set to true work, if the data source client resend the entire batch?
A question regarding streamer with overwrite option set to true. How does the streamer compare the content the data in hand with the data in cache, if each record is being assigned UUID when being  inserted to cache?


On Tue, Jan 7, 2020 at 4:40 AM Andrei Aleksandrov <[hidden email]> wrote:
Hi,

Not flushed data in a data streamer will be lost. Data streamer works
thought some Ignite node and in case if this the node failed it can't
somehow start working with another one. So your application should think
about how to track that all data was loaded (wait for completion of
loading, catch the exceptions, check the cache sizes, etc) and use
another client for data loading in case if previous one was failed.

BR,
Andrei

1/6/2020 2:37 AM, narges saleh пишет:
> Hi All,
>
> Another question regarding ignite's streamer.
> What happens to the data if the streamer node crashes before the
> buffer's content is flushed to the cache? Is the client responsible
> for making sure the data is persisted or ignite redirects the data to
> another node's streamer?
>
> thanks.
narges saleh narges saleh
Reply | Threaded
Open this post in threaded view
|

Re: Streamer and data loss

Hello Ilya,

If I use putAll() operation then I won't get the streamer's bulk performance, would I? I have a huge amount of data to persist.

thanks.

On Thu, Jan 16, 2020 at 8:43 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

I think you should consider using putAll() operation if resiliency is important for you, since this operation will be salvaged if initiator node fails.

Regards,
--
Ilya Kasnacheev


чт, 16 янв. 2020 г. в 15:48, narges saleh <[hidden email]>:
Thanks Saikat.

I am not sure if sequential keys/timestamps and Kafka like offsets would help if there are many data source clients and many streamer nodes in play; depending on the checkpoint, we might still end up duplicates (unless you're saying each client sequences its payload before sending it to the streamer; even then duplicates are possible on the cache). The only sure way, it seems to me, is for the client that catches the exception to check the cache and only resend the diff, which make things very complex. The other approach, if I am right is, to enable overwrite, so the streamer would dedup the data in cache. The latter is costly too. I think the ideal approach would have been if there were some type of streamer resiliency in place where another streamer node could pick up the buffer from a crashed streamer and continue the work.


On Wed, Jan 15, 2020 at 9:00 PM Saikat Maitra <[hidden email]> wrote:
Hi,

To minimise data loss during streamer node failure I think we can use the following steps:

1. Use autoFlushFrequency param to set the desired flush frequency, depending on desired consistency level and performance you can choose how frequently you would like the data to be flush to Ignite nodes.

2. Develop a automated checkpointing process to capture and store the source data offset, it can be something like kafka message offset or cache keys if keys are sequential or timestamp for last flush and depending on that the Ignite client can restart the data streaming process from last checkpoint if there are node failure.

HTH

Regards,
Saikat

On Fri, Jan 10, 2020 at 4:34 AM narges saleh <[hidden email]> wrote:
Thanks Saikat for the feedback.

But if I use the overwrite option set to true to avoid duplicates in case I have to resend the entire payload in case of a streamer node failure, then I won't
 get optimal performance, right?
What's the best practice for dealing with data streamer node failures? Are there examples?

On Thu, Jan 9, 2020 at 9:12 PM Saikat Maitra <[hidden email]> wrote:
Hi,

AFAIK, the DataStreamer check for presence of key and if it is present in the cache then it does not allow overwrite of value if allowOverwrite is set to false.

Regards,
Saikat

On Thu, Jan 9, 2020 at 6:04 AM narges saleh <[hidden email]> wrote:
Thanks Andrei.

If the external data source client sending batches of 2-3 MB say via TCP socket connection to a bunch of socket streamers (deployed as ignite services deployed to each ignite node) and say of the streamer nodes die, the data source client catching the exception, has to check the cache to see how much of the 2-4MB batch has been flushed to cache and resend the rest? Would setting streamer with overwrite set to true work, if the data source client resend the entire batch?
A question regarding streamer with overwrite option set to true. How does the streamer compare the content the data in hand with the data in cache, if each record is being assigned UUID when being  inserted to cache?


On Tue, Jan 7, 2020 at 4:40 AM Andrei Aleksandrov <[hidden email]> wrote:
Hi,

Not flushed data in a data streamer will be lost. Data streamer works
thought some Ignite node and in case if this the node failed it can't
somehow start working with another one. So your application should think
about how to track that all data was loaded (wait for completion of
loading, catch the exceptions, check the cache sizes, etc) and use
another client for data loading in case if previous one was failed.

BR,
Andrei

1/6/2020 2:37 AM, narges saleh пишет:
> Hi All,
>
> Another question regarding ignite's streamer.
> What happens to the data if the streamer node crashes before the
> buffer's content is flushed to the cache? Is the client responsible
> for making sure the data is persisted or ignite redirects the data to
> another node's streamer?
>
> thanks.
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Streamer and data loss

Hello!

If you use it in a smart way you can get very close performance (to allowOverwrite=true data streamer), I guess.

Just call it with a decent number of entries belonging to the same cache partition from multiple threads, with non-intersecting keys of course.

Regards,
--
Ilya Kasnacheev


чт, 16 янв. 2020 г. в 21:29, narges saleh <[hidden email]>:
Hello Ilya,

If I use putAll() operation then I won't get the streamer's bulk performance, would I? I have a huge amount of data to persist.

thanks.

On Thu, Jan 16, 2020 at 8:43 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

I think you should consider using putAll() operation if resiliency is important for you, since this operation will be salvaged if initiator node fails.

Regards,
--
Ilya Kasnacheev


чт, 16 янв. 2020 г. в 15:48, narges saleh <[hidden email]>:
Thanks Saikat.

I am not sure if sequential keys/timestamps and Kafka like offsets would help if there are many data source clients and many streamer nodes in play; depending on the checkpoint, we might still end up duplicates (unless you're saying each client sequences its payload before sending it to the streamer; even then duplicates are possible on the cache). The only sure way, it seems to me, is for the client that catches the exception to check the cache and only resend the diff, which make things very complex. The other approach, if I am right is, to enable overwrite, so the streamer would dedup the data in cache. The latter is costly too. I think the ideal approach would have been if there were some type of streamer resiliency in place where another streamer node could pick up the buffer from a crashed streamer and continue the work.


On Wed, Jan 15, 2020 at 9:00 PM Saikat Maitra <[hidden email]> wrote:
Hi,

To minimise data loss during streamer node failure I think we can use the following steps:

1. Use autoFlushFrequency param to set the desired flush frequency, depending on desired consistency level and performance you can choose how frequently you would like the data to be flush to Ignite nodes.

2. Develop a automated checkpointing process to capture and store the source data offset, it can be something like kafka message offset or cache keys if keys are sequential or timestamp for last flush and depending on that the Ignite client can restart the data streaming process from last checkpoint if there are node failure.

HTH

Regards,
Saikat

On Fri, Jan 10, 2020 at 4:34 AM narges saleh <[hidden email]> wrote:
Thanks Saikat for the feedback.

But if I use the overwrite option set to true to avoid duplicates in case I have to resend the entire payload in case of a streamer node failure, then I won't
 get optimal performance, right?
What's the best practice for dealing with data streamer node failures? Are there examples?

On Thu, Jan 9, 2020 at 9:12 PM Saikat Maitra <[hidden email]> wrote:
Hi,

AFAIK, the DataStreamer check for presence of key and if it is present in the cache then it does not allow overwrite of value if allowOverwrite is set to false.

Regards,
Saikat

On Thu, Jan 9, 2020 at 6:04 AM narges saleh <[hidden email]> wrote:
Thanks Andrei.

If the external data source client sending batches of 2-3 MB say via TCP socket connection to a bunch of socket streamers (deployed as ignite services deployed to each ignite node) and say of the streamer nodes die, the data source client catching the exception, has to check the cache to see how much of the 2-4MB batch has been flushed to cache and resend the rest? Would setting streamer with overwrite set to true work, if the data source client resend the entire batch?
A question regarding streamer with overwrite option set to true. How does the streamer compare the content the data in hand with the data in cache, if each record is being assigned UUID when being  inserted to cache?


On Tue, Jan 7, 2020 at 4:40 AM Andrei Aleksandrov <[hidden email]> wrote:
Hi,

Not flushed data in a data streamer will be lost. Data streamer works
thought some Ignite node and in case if this the node failed it can't
somehow start working with another one. So your application should think
about how to track that all data was loaded (wait for completion of
loading, catch the exceptions, check the cache sizes, etc) and use
another client for data loading in case if previous one was failed.

BR,
Andrei

1/6/2020 2:37 AM, narges saleh пишет:
> Hi All,
>
> Another question regarding ignite's streamer.
> What happens to the data if the streamer node crashes before the
> buffer's content is flushed to the cache? Is the client responsible
> for making sure the data is persisted or ignite redirects the data to
> another node's streamer?
>
> thanks.