Ignite in-memory + other SQL store without fully loading all data into Ignite

classic Classic list List threaded Threaded
6 messages Options
Courtney Robinson Courtney Robinson
Reply | Threaded
Open this post in threaded view
|

Ignite in-memory + other SQL store without fully loading all data into Ignite

We've been using Ignite in production for almost 3 years and we love the platform but there are some increasingly frustrating points we run into. Before the holidays a few of our engineers started looking around and have now presented a serious case for migrating from Ignite. We would end up using at least 3 technologies they've identified to bridge the gap left by Ignite features but have presented good cases for why managing these individually would be a more flexible solution we could grow with.

I am not keen on the idea as it presents a major refactor that would likely take 6 months to get to production but I understand and agree with the points they've made. I'm trying to find a middle ground as this seems like the nuclear option to me. 

Of the top of my head some things they've raised are:
  1. Lack of tooling
    1. Inability to change a column type + no support in schema migration tools that we've found (even deleting the column we can't reuse the name)
    2. We had to build our own backup solution and even now backup has landed in 2.9 we can't use it directly because we have implemented relatively granular backup to be able to push to S3 compatible APIs (Ceph in our case) and restore partially to per hour granularity. Whilst we've done it, it took some serious engineering effort and time. We considered open sourcing it but it was done in a way that's tightly coupled to our internal stack and APIs.
  2. Inconsistency between various Ignite APIs.
    1. Transactions on KV, none of SQL (or now in beta but work seem seems to have paused?)
    2. SQL limitations - SELECT queries never read through data from the external database
    3. Even if we implemented a CacheStore we have to load all data into Ignite to use SELECT
    4. No referential integrity enforcement
  3. It is incredibly easy to corrupt the data in Ignite persistence. We've gotten better due to operational experience but upgrades (in k8s) still on occasion lead to one or two nodes being corrupt when their pod was stopped 
I'll stop there but the point is, after 3yrs in production the team feels like they're always running up against a wall and that Ignite has created that wall.

My goal in writing this is to find out more about why the limitation around CacheStore exists, how does Ignite persistence achieve partially caching data in memory and pulling from disk if the data is not in memory and why can't that apply to a CacheStore as well?

What would it take to make it so that Ignite's SQL operations could be pushed down to a CacheStore implementation?
Ignite's a relatively large code base, so hints about which classes/interfaces to investigate if we're looking to replace Ignite persistence would be incredibly useful. My idea at the moment is to have Ignite as the in-memory SQL layer with a SQL MPP providing persistence.

To me right now the path forward is for us to put the work into removing these Ignite limitations if possible. We have a mixture of on-premise clients for our product as well as a multi-tenant SaaS version - some of these on-prem clients depend on Ignite's in-memory capabilities and so we can't easily take this away.

FYI it doesn't have to be CacheStore, I realise this just inherits the JCache interface just generally can something like CacheStore be implemented to replace the integration that Ignite persistence provides?

Regards,
Courtney Robinson
Founder and CEO, Hypi

https://hypi.io
Courtney Robinson Courtney Robinson
Reply | Threaded
Open this post in threaded view
|

Re: Ignite in-memory + other SQL store without fully loading all data into Ignite

I know this was over the holiday so bumping. Can anyone provide any pointers on where to start looking or anything else mentioned in the previous email?
Thanks

On Sat, Dec 26, 2020 at 8:39 PM Courtney Robinson <[hidden email]> wrote:
We've been using Ignite in production for almost 3 years and we love the platform but there are some increasingly frustrating points we run into. Before the holidays a few of our engineers started looking around and have now presented a serious case for migrating from Ignite. We would end up using at least 3 technologies they've identified to bridge the gap left by Ignite features but have presented good cases for why managing these individually would be a more flexible solution we could grow with.

I am not keen on the idea as it presents a major refactor that would likely take 6 months to get to production but I understand and agree with the points they've made. I'm trying to find a middle ground as this seems like the nuclear option to me. 

Of the top of my head some things they've raised are:
  1. Lack of tooling
    1. Inability to change a column type + no support in schema migration tools that we've found (even deleting the column we can't reuse the name)
    2. We had to build our own backup solution and even now backup has landed in 2.9 we can't use it directly because we have implemented relatively granular backup to be able to push to S3 compatible APIs (Ceph in our case) and restore partially to per hour granularity. Whilst we've done it, it took some serious engineering effort and time. We considered open sourcing it but it was done in a way that's tightly coupled to our internal stack and APIs.
  2. Inconsistency between various Ignite APIs.
    1. Transactions on KV, none of SQL (or now in beta but work seem seems to have paused?)
    2. SQL limitations - SELECT queries never read through data from the external database
    3. Even if we implemented a CacheStore we have to load all data into Ignite to use SELECT
    4. No referential integrity enforcement
  3. It is incredibly easy to corrupt the data in Ignite persistence. We've gotten better due to operational experience but upgrades (in k8s) still on occasion lead to one or two nodes being corrupt when their pod was stopped 
I'll stop there but the point is, after 3yrs in production the team feels like they're always running up against a wall and that Ignite has created that wall.

My goal in writing this is to find out more about why the limitation around CacheStore exists, how does Ignite persistence achieve partially caching data in memory and pulling from disk if the data is not in memory and why can't that apply to a CacheStore as well?

What would it take to make it so that Ignite's SQL operations could be pushed down to a CacheStore implementation?
Ignite's a relatively large code base, so hints about which classes/interfaces to investigate if we're looking to replace Ignite persistence would be incredibly useful. My idea at the moment is to have Ignite as the in-memory SQL layer with a SQL MPP providing persistence.

To me right now the path forward is for us to put the work into removing these Ignite limitations if possible. We have a mixture of on-premise clients for our product as well as a multi-tenant SaaS version - some of these on-prem clients depend on Ignite's in-memory capabilities and so we can't easily take this away.

FYI it doesn't have to be CacheStore, I realise this just inherits the JCache interface just generally can something like CacheStore be implemented to replace the integration that Ignite persistence provides?

Regards,
Courtney Robinson
Founder and CEO, Hypi
https://hypi.io
vkulichenko vkulichenko
Reply | Threaded
Open this post in threaded view
|

Re: Ignite in-memory + other SQL store without fully loading all data into Ignite

Hi Courtney,

Thanks for your feedback!

To cut the story short, Ignite implements page memory architecture. All data is split into fixed-sized pages and any page can reside either both in memory and on disk or on disk only. Since the in-memory layer and persistence layer are natively integrated - i.e. there is a unified mechanism for page lookup - Ignite storage engine always knows where a particular page is and can transparently load it into memory if needed. From this perspective, Ignite is actually very similar to how other databases work, although it has a much bigger focus on scalability and in-memory processing, richer compute capabilities, etc.

With an external database it's much more complicated - generally speaking, there is no way of knowing which data needs to be loaded from the database into memory for the purpose of an arbitrary query. This is a fundamental limitation and quite frankly, I don't think there is any other technology that can do that. Not in a consistent and performant way at least. As a matter of fact, this limitation is exactly what used to be the primary driver behind the Native Persistent development in the first place.

Is there any particular reason why you can't use Native Persistence instead of an external database? It sounds like that's what you need for your use case unless I'm missing something. Can you tell us a little bit more about your use case and requirements? I'm sure we can come up with a solution that would satisfy those requirements without a need for re-implementing the whole thing.

Thanks,
Val

On Mon, Dec 28, 2020 at 12:58 PM Courtney Robinson <[hidden email]> wrote:
I know this was over the holiday so bumping. Can anyone provide any pointers on where to start looking or anything else mentioned in the previous email?
Thanks

On Sat, Dec 26, 2020 at 8:39 PM Courtney Robinson <[hidden email]> wrote:
We've been using Ignite in production for almost 3 years and we love the platform but there are some increasingly frustrating points we run into. Before the holidays a few of our engineers started looking around and have now presented a serious case for migrating from Ignite. We would end up using at least 3 technologies they've identified to bridge the gap left by Ignite features but have presented good cases for why managing these individually would be a more flexible solution we could grow with.

I am not keen on the idea as it presents a major refactor that would likely take 6 months to get to production but I understand and agree with the points they've made. I'm trying to find a middle ground as this seems like the nuclear option to me. 

Of the top of my head some things they've raised are:
  1. Lack of tooling
    1. Inability to change a column type + no support in schema migration tools that we've found (even deleting the column we can't reuse the name)
    2. We had to build our own backup solution and even now backup has landed in 2.9 we can't use it directly because we have implemented relatively granular backup to be able to push to S3 compatible APIs (Ceph in our case) and restore partially to per hour granularity. Whilst we've done it, it took some serious engineering effort and time. We considered open sourcing it but it was done in a way that's tightly coupled to our internal stack and APIs.
  2. Inconsistency between various Ignite APIs.
    1. Transactions on KV, none of SQL (or now in beta but work seem seems to have paused?)
    2. SQL limitations - SELECT queries never read through data from the external database
    3. Even if we implemented a CacheStore we have to load all data into Ignite to use SELECT
    4. No referential integrity enforcement
  3. It is incredibly easy to corrupt the data in Ignite persistence. We've gotten better due to operational experience but upgrades (in k8s) still on occasion lead to one or two nodes being corrupt when their pod was stopped 
I'll stop there but the point is, after 3yrs in production the team feels like they're always running up against a wall and that Ignite has created that wall.

My goal in writing this is to find out more about why the limitation around CacheStore exists, how does Ignite persistence achieve partially caching data in memory and pulling from disk if the data is not in memory and why can't that apply to a CacheStore as well?

What would it take to make it so that Ignite's SQL operations could be pushed down to a CacheStore implementation?
Ignite's a relatively large code base, so hints about which classes/interfaces to investigate if we're looking to replace Ignite persistence would be incredibly useful. My idea at the moment is to have Ignite as the in-memory SQL layer with a SQL MPP providing persistence.

To me right now the path forward is for us to put the work into removing these Ignite limitations if possible. We have a mixture of on-premise clients for our product as well as a multi-tenant SaaS version - some of these on-prem clients depend on Ignite's in-memory capabilities and so we can't easily take this away.

FYI it doesn't have to be CacheStore, I realise this just inherits the JCache interface just generally can something like CacheStore be implemented to replace the integration that Ignite persistence provides?

Regards,
Courtney Robinson
Founder and CEO, Hypi
https://hypi.io
Courtney Robinson Courtney Robinson
Reply | Threaded
Open this post in threaded view
|

Re: Ignite in-memory + other SQL store without fully loading all data into Ignite

Hi Val,
Thanks. You're not missing anything and we have been using Ignite persistence for years.
Among the reasons we want to switch to in-memory only is how easily we seem to get corrupt nodes. I mentioned in the first email. We haven't had a situation where upgrading corrupts more than 1 of our 3 replicas but we genuinely fear upgrading sometimes as a result and is why we put so much effort into our backup/restore solution.

One of the arguments for disabling persistence is that other DBs seems to have had longer to solve the issue. We've operated Postgres for longer than Ignite persistence and have never had it corrupt its data on disk (admittedly it's not distributed so that could play a role) but we've also been running Cassandra for longer as well without ever having any data corruption. All 3 in Kubernetes. With Ignite we've tried many things and a few settings between Ignite rebalanceDelay (I forget which other ones) and k8s readiness/liveness probes seems to have landed in a sweet spot that's reduced how often it happens but if we happen to have any issues with the k8s control plane, scheduling delays or network issues then the chances of it skyrockets. It then requires manual intervention I believe the most frequent issue is that a node will start and the topology diverged so the cluster doesn't think it's a part of it, the most effective thing we've found is deleting the data on that node and having it join as a new node for re-balancing to kicking and push data back to it. I think we have a few tickets somewhere with details.

This is the primary motivator for wanting to replace Ignite persistence.

 Here's our use case:
We provide a low-code app development platform. As a developer you define tables in a visual interface or via GraphQL. We map these on to Ignite tables. It is a multi-tenant system complete with authentication/authorisation, serverless functions, API Gateway and more...all the bells and whistles for building modern applications in record time. We make it easy and convenient to get the scale and performance you expect from a large company with as little as the effort of an intern. 

The thing about convenience is that there are expectations set by others (the industry in this case) that the convenient things you provide include a baseline that matches industry or whoever has set the expectations. 
It is a sizeable task to implement/enforce referential integrity and delete cascade application side and I don't know of any distributed database which has this built in but the lack of transactions in SQL means we can't really implement it without pretty much implementing what the DB should be doing to keep track and rollback on failure.

In an earlier version of the platform we implemented this using the key value APIs but we were not using SQL at all. We moved to SQL 1.5yrs ago or so because we needed greater and greater query capability and we were implementing a custom query layer on the KV APIs that were starting to look remarkably close to what SQL could do.

In our use case, there are only 4 tables known at development time. Every app in the platform defines its tables dynamically and this has an impact on what Ignite capabilities are realistic to use. Including our own UI that lets users create other apps, everything is user defined.

One issue that our customers are constantly complaining about is the inability for them to change a column's type.
We tried generating Java classes, compiling and including Ignite annotations but found that the JVM ends up doing a lot more GC leading to crazy long GC pauses. Generating classes meant we could re-map the same field to a different field in the generated class and silently migrate the data on known types like int to float etc or with a user provided function to do the conversion. It proved impractical as we'd sometimes get JVMs unusable for long times and if we had this happening on 3 nodes that happen to have all 3 replicas of your data then your entire app's response time goes from 50 milliseconds to several seconds or however long it lasts for (10s of seconds sometimes).

This got worse as the JVM pauses would lead to Ignite map (I forget the exact error but it was something about failing to execute a map during a SQL query) failures on unrelated queries causing a cascade of failures across tenants.
We're investigating a new approach for this by storing mappings from user fields to different table names so we'd be able to get back the ability to change the type of a field and migrate old data. We're not sure of the overhead this will introduce yet, still being investigated. Your thoughts here are appreciated.

Next is on extending the available SQL dialect, we include indexing capabilities for our customers. Before migrating to the SQL APIs, we integrated Apache Lucene (we have a filter language that mapped to lucene queries). Indices were replicated but not partitioned and we used affinity compute to send queries to the right nodes which would send back a list of IDs we could get using the KV APIs. This worked well and provided superior filtering capabilities, migrating to the SQL APIs meant we lost this as there is no way we found to extend SQL to provide a custom function that;d be resolved to a custom affinity run, importantly, using the results of this function to reduce/filter the data coming back from SQL queries. Right now we're hacking it with some SQL operators but the search quality doesn't even compare to what we could do with the Lucene + KV APIs. This is something we're actively investigating as the current approach is showing its limits as we grow. We've also got on-premise customers set to make this unusable in Q1 as their low-code app data volumes begin to exceed what the query hacks can reasonably filter on within acceptable time. 

Right now we're looking at two options: First is using the old approach we had of affinity run over Lucene and using the results in an IN clause in a subsequent query or forking Ignite to extend the SQL dialect and integrating the lucene results in Ignite's map/filter. MVP here is doing this with all supported lucene queries over a lucene index of a 1TB table.

This last point is more of a preference than anything. We're working to bring the ease and flexibility of low-code to machine learning. Ignite's ML capabilities are pretty broad, my issue is that it's completely Java based. It would be amazing if Ignite had taken the SQL approach or had SQL integration mapped to the Java APIs (I'm thinking of Apache Madlib).

I guess one of the frustrations is we're spending a lot of time at the DB level instead of at the application level adding useful features for customers. As a small startup we don't really have the bandwidth so it's a constant battle between product features and infrastructure solutions.

We are not expecting anything to happen immediately with our Ignite setup, it'll stay for a while but how we use it may change drastically over time. We can't easily abandon the in-memory capabilities either. One suggestion being considered (to have in place by end of 2021) is using Apache Calcite as the SQL layer with an Ignite driver (mapping to in-memory Ignite). In this setup we'd use Calcite to push down writes to Ignite and Greenplum (MPP Postgres). We'd also push down queries to Ignite in our driver and where it didn't match, we'd push to Greenplum. This would give us the ability to drop in the custom SQL functions to map to Lucene or indeed Elasticsearch. This has the added benefit of getting things from Calcite like SQL window functions. We'd effectively be using Ignite in-memory to accelerate Greenplum and ES queries through a consistent SQL API whilst getting the benefits of the well established and battle tested Postgres tools and ecosystem...and SQL base ML. I'd appreciate your thoughts here on this around the suggested Ignite use. Almost everything can be pushed down to Ignite from Calcite and anything Ignite doesn't support Calcite can emulate.

Regards,
Courtney Robinson
Founder and CEO, Hypi

https://hypi.io


On Tue, Dec 29, 2020 at 4:01 AM Valentin Kulichenko <[hidden email]> wrote:
Hi Courtney,

Thanks for your feedback!

To cut the story short, Ignite implements page memory architecture. All data is split into fixed-sized pages and any page can reside either both in memory and on disk or on disk only. Since the in-memory layer and persistence layer are natively integrated - i.e. there is a unified mechanism for page lookup - Ignite storage engine always knows where a particular page is and can transparently load it into memory if needed. From this perspective, Ignite is actually very similar to how other databases work, although it has a much bigger focus on scalability and in-memory processing, richer compute capabilities, etc.

With an external database it's much more complicated - generally speaking, there is no way of knowing which data needs to be loaded from the database into memory for the purpose of an arbitrary query. This is a fundamental limitation and quite frankly, I don't think there is any other technology that can do that. Not in a consistent and performant way at least. As a matter of fact, this limitation is exactly what used to be the primary driver behind the Native Persistent development in the first place.

Is there any particular reason why you can't use Native Persistence instead of an external database? It sounds like that's what you need for your use case unless I'm missing something. Can you tell us a little bit more about your use case and requirements? I'm sure we can come up with a solution that would satisfy those requirements without a need for re-implementing the whole thing.

Thanks,
Val

On Mon, Dec 28, 2020 at 12:58 PM Courtney Robinson <[hidden email]> wrote:
I know this was over the holiday so bumping. Can anyone provide any pointers on where to start looking or anything else mentioned in the previous email?
Thanks

On Sat, Dec 26, 2020 at 8:39 PM Courtney Robinson <[hidden email]> wrote:
We've been using Ignite in production for almost 3 years and we love the platform but there are some increasingly frustrating points we run into. Before the holidays a few of our engineers started looking around and have now presented a serious case for migrating from Ignite. We would end up using at least 3 technologies they've identified to bridge the gap left by Ignite features but have presented good cases for why managing these individually would be a more flexible solution we could grow with.

I am not keen on the idea as it presents a major refactor that would likely take 6 months to get to production but I understand and agree with the points they've made. I'm trying to find a middle ground as this seems like the nuclear option to me. 

Of the top of my head some things they've raised are:
  1. Lack of tooling
    1. Inability to change a column type + no support in schema migration tools that we've found (even deleting the column we can't reuse the name)
    2. We had to build our own backup solution and even now backup has landed in 2.9 we can't use it directly because we have implemented relatively granular backup to be able to push to S3 compatible APIs (Ceph in our case) and restore partially to per hour granularity. Whilst we've done it, it took some serious engineering effort and time. We considered open sourcing it but it was done in a way that's tightly coupled to our internal stack and APIs.
  2. Inconsistency between various Ignite APIs.
    1. Transactions on KV, none of SQL (or now in beta but work seem seems to have paused?)
    2. SQL limitations - SELECT queries never read through data from the external database
    3. Even if we implemented a CacheStore we have to load all data into Ignite to use SELECT
    4. No referential integrity enforcement
  3. It is incredibly easy to corrupt the data in Ignite persistence. We've gotten better due to operational experience but upgrades (in k8s) still on occasion lead to one or two nodes being corrupt when their pod was stopped 
I'll stop there but the point is, after 3yrs in production the team feels like they're always running up against a wall and that Ignite has created that wall.

My goal in writing this is to find out more about why the limitation around CacheStore exists, how does Ignite persistence achieve partially caching data in memory and pulling from disk if the data is not in memory and why can't that apply to a CacheStore as well?

What would it take to make it so that Ignite's SQL operations could be pushed down to a CacheStore implementation?
Ignite's a relatively large code base, so hints about which classes/interfaces to investigate if we're looking to replace Ignite persistence would be incredibly useful. My idea at the moment is to have Ignite as the in-memory SQL layer with a SQL MPP providing persistence.

To me right now the path forward is for us to put the work into removing these Ignite limitations if possible. We have a mixture of on-premise clients for our product as well as a multi-tenant SaaS version - some of these on-prem clients depend on Ignite's in-memory capabilities and so we can't easily take this away.

FYI it doesn't have to be CacheStore, I realise this just inherits the JCache interface just generally can something like CacheStore be implemented to replace the integration that Ignite persistence provides?

Regards,
Courtney Robinson
Founder and CEO, Hypi
https://hypi.io
dmagda dmagda
Reply | Threaded
Open this post in threaded view
|

Re: Ignite in-memory + other SQL store without fully loading all data into Ignite

Hi Courtney,

Glad to hear from you! It's been a while since we met last time. It's truly disappointing seeing you struggle with Ignite that much. Thanks for being open and kicking the discussion off.

How about three of us (you, Val, and I) meet the second week of January and talk out the issues? Then you can share a talk summary here with a broader community, if you wish. Between us, I've been personally championing the column-type-change feature for a while, and with this IEP of Ignite 3.0 it should be doable.

In the meantime, some items for our agenda and pointers:
  • How does your upgrade procedure look like? Ignite doesn't have a rolling-upgrade feature; thus, to upgrade from version A to B in a *consistent way* the one should stop the cluster.
  • re: the frustration about spending too much time on the administration and infrastructure-related activities. Are you considering any managed service options? As a developer, I understand this frustration. I was just lucky to stay away from a need to administer Postgres, Oracle, MySQL, or Ignite just because we either had an IT administrator or made use of a managed-service option. Not selling anything here, just know that all those administration routines are unavoidable. The thing with Ignite is that distributed nature makes things more complicated. 
  • Ignite ML and SQL: in fact, you can already call Ignite ML models from a SQL query. I need to search for pointers. [hidden email], you can probably share some examples quicker.
  • Calcite and Ignite: in Q2 we're planning to release the Calcite-powered SQL engine in Ignite. That would be a dramatic improvement in our SQL capabilities. It should be possible to enable push-downs much more easily.

-
Denis


On Mon, Dec 28, 2020 at 11:50 PM Courtney Robinson <[hidden email]> wrote:
Hi Val,
Thanks. You're not missing anything and we have been using Ignite persistence for years.
Among the reasons we want to switch to in-memory only is how easily we seem to get corrupt nodes. I mentioned in the first email. We haven't had a situation where upgrading corrupts more than 1 of our 3 replicas but we genuinely fear upgrading sometimes as a result and is why we put so much effort into our backup/restore solution.

One of the arguments for disabling persistence is that other DBs seems to have had longer to solve the issue. We've operated Postgres for longer than Ignite persistence and have never had it corrupt its data on disk (admittedly it's not distributed so that could play a role) but we've also been running Cassandra for longer as well without ever having any data corruption. All 3 in Kubernetes. With Ignite we've tried many things and a few settings between Ignite rebalanceDelay (I forget which other ones) and k8s readiness/liveness probes seems to have landed in a sweet spot that's reduced how often it happens but if we happen to have any issues with the k8s control plane, scheduling delays or network issues then the chances of it skyrockets. It then requires manual intervention I believe the most frequent issue is that a node will start and the topology diverged so the cluster doesn't think it's a part of it, the most effective thing we've found is deleting the data on that node and having it join as a new node for re-balancing to kicking and push data back to it. I think we have a few tickets somewhere with details.

This is the primary motivator for wanting to replace Ignite persistence.

 Here's our use case:
We provide a low-code app development platform. As a developer you define tables in a visual interface or via GraphQL. We map these on to Ignite tables. It is a multi-tenant system complete with authentication/authorisation, serverless functions, API Gateway and more...all the bells and whistles for building modern applications in record time. We make it easy and convenient to get the scale and performance you expect from a large company with as little as the effort of an intern. 

The thing about convenience is that there are expectations set by others (the industry in this case) that the convenient things you provide include a baseline that matches industry or whoever has set the expectations. 
It is a sizeable task to implement/enforce referential integrity and delete cascade application side and I don't know of any distributed database which has this built in but the lack of transactions in SQL means we can't really implement it without pretty much implementing what the DB should be doing to keep track and rollback on failure.

In an earlier version of the platform we implemented this using the key value APIs but we were not using SQL at all. We moved to SQL 1.5yrs ago or so because we needed greater and greater query capability and we were implementing a custom query layer on the KV APIs that were starting to look remarkably close to what SQL could do.

In our use case, there are only 4 tables known at development time. Every app in the platform defines its tables dynamically and this has an impact on what Ignite capabilities are realistic to use. Including our own UI that lets users create other apps, everything is user defined.

One issue that our customers are constantly complaining about is the inability for them to change a column's type.
We tried generating Java classes, compiling and including Ignite annotations but found that the JVM ends up doing a lot more GC leading to crazy long GC pauses. Generating classes meant we could re-map the same field to a different field in the generated class and silently migrate the data on known types like int to float etc or with a user provided function to do the conversion. It proved impractical as we'd sometimes get JVMs unusable for long times and if we had this happening on 3 nodes that happen to have all 3 replicas of your data then your entire app's response time goes from 50 milliseconds to several seconds or however long it lasts for (10s of seconds sometimes).

This got worse as the JVM pauses would lead to Ignite map (I forget the exact error but it was something about failing to execute a map during a SQL query) failures on unrelated queries causing a cascade of failures across tenants.
We're investigating a new approach for this by storing mappings from user fields to different table names so we'd be able to get back the ability to change the type of a field and migrate old data. We're not sure of the overhead this will introduce yet, still being investigated. Your thoughts here are appreciated.

Next is on extending the available SQL dialect, we include indexing capabilities for our customers. Before migrating to the SQL APIs, we integrated Apache Lucene (we have a filter language that mapped to lucene queries). Indices were replicated but not partitioned and we used affinity compute to send queries to the right nodes which would send back a list of IDs we could get using the KV APIs. This worked well and provided superior filtering capabilities, migrating to the SQL APIs meant we lost this as there is no way we found to extend SQL to provide a custom function that;d be resolved to a custom affinity run, importantly, using the results of this function to reduce/filter the data coming back from SQL queries. Right now we're hacking it with some SQL operators but the search quality doesn't even compare to what we could do with the Lucene + KV APIs. This is something we're actively investigating as the current approach is showing its limits as we grow. We've also got on-premise customers set to make this unusable in Q1 as their low-code app data volumes begin to exceed what the query hacks can reasonably filter on within acceptable time. 

Right now we're looking at two options: First is using the old approach we had of affinity run over Lucene and using the results in an IN clause in a subsequent query or forking Ignite to extend the SQL dialect and integrating the lucene results in Ignite's map/filter. MVP here is doing this with all supported lucene queries over a lucene index of a 1TB table.

This last point is more of a preference than anything. We're working to bring the ease and flexibility of low-code to machine learning. Ignite's ML capabilities are pretty broad, my issue is that it's completely Java based. It would be amazing if Ignite had taken the SQL approach or had SQL integration mapped to the Java APIs (I'm thinking of Apache Madlib).

I guess one of the frustrations is we're spending a lot of time at the DB level instead of at the application level adding useful features for customers. As a small startup we don't really have the bandwidth so it's a constant battle between product features and infrastructure solutions.

We are not expecting anything to happen immediately with our Ignite setup, it'll stay for a while but how we use it may change drastically over time. We can't easily abandon the in-memory capabilities either. One suggestion being considered (to have in place by end of 2021) is using Apache Calcite as the SQL layer with an Ignite driver (mapping to in-memory Ignite). In this setup we'd use Calcite to push down writes to Ignite and Greenplum (MPP Postgres). We'd also push down queries to Ignite in our driver and where it didn't match, we'd push to Greenplum. This would give us the ability to drop in the custom SQL functions to map to Lucene or indeed Elasticsearch. This has the added benefit of getting things from Calcite like SQL window functions. We'd effectively be using Ignite in-memory to accelerate Greenplum and ES queries through a consistent SQL API whilst getting the benefits of the well established and battle tested Postgres tools and ecosystem...and SQL base ML. I'd appreciate your thoughts here on this around the suggested Ignite use. Almost everything can be pushed down to Ignite from Calcite and anything Ignite doesn't support Calcite can emulate.

Regards,
Courtney Robinson
Founder and CEO, Hypi

https://hypi.io


On Tue, Dec 29, 2020 at 4:01 AM Valentin Kulichenko <[hidden email]> wrote:
Hi Courtney,

Thanks for your feedback!

To cut the story short, Ignite implements page memory architecture. All data is split into fixed-sized pages and any page can reside either both in memory and on disk or on disk only. Since the in-memory layer and persistence layer are natively integrated - i.e. there is a unified mechanism for page lookup - Ignite storage engine always knows where a particular page is and can transparently load it into memory if needed. From this perspective, Ignite is actually very similar to how other databases work, although it has a much bigger focus on scalability and in-memory processing, richer compute capabilities, etc.

With an external database it's much more complicated - generally speaking, there is no way of knowing which data needs to be loaded from the database into memory for the purpose of an arbitrary query. This is a fundamental limitation and quite frankly, I don't think there is any other technology that can do that. Not in a consistent and performant way at least. As a matter of fact, this limitation is exactly what used to be the primary driver behind the Native Persistent development in the first place.

Is there any particular reason why you can't use Native Persistence instead of an external database? It sounds like that's what you need for your use case unless I'm missing something. Can you tell us a little bit more about your use case and requirements? I'm sure we can come up with a solution that would satisfy those requirements without a need for re-implementing the whole thing.

Thanks,
Val

On Mon, Dec 28, 2020 at 12:58 PM Courtney Robinson <[hidden email]> wrote:
I know this was over the holiday so bumping. Can anyone provide any pointers on where to start looking or anything else mentioned in the previous email?
Thanks

On Sat, Dec 26, 2020 at 8:39 PM Courtney Robinson <[hidden email]> wrote:
We've been using Ignite in production for almost 3 years and we love the platform but there are some increasingly frustrating points we run into. Before the holidays a few of our engineers started looking around and have now presented a serious case for migrating from Ignite. We would end up using at least 3 technologies they've identified to bridge the gap left by Ignite features but have presented good cases for why managing these individually would be a more flexible solution we could grow with.

I am not keen on the idea as it presents a major refactor that would likely take 6 months to get to production but I understand and agree with the points they've made. I'm trying to find a middle ground as this seems like the nuclear option to me. 

Of the top of my head some things they've raised are:
  1. Lack of tooling
    1. Inability to change a column type + no support in schema migration tools that we've found (even deleting the column we can't reuse the name)
    2. We had to build our own backup solution and even now backup has landed in 2.9 we can't use it directly because we have implemented relatively granular backup to be able to push to S3 compatible APIs (Ceph in our case) and restore partially to per hour granularity. Whilst we've done it, it took some serious engineering effort and time. We considered open sourcing it but it was done in a way that's tightly coupled to our internal stack and APIs.
  2. Inconsistency between various Ignite APIs.
    1. Transactions on KV, none of SQL (or now in beta but work seem seems to have paused?)
    2. SQL limitations - SELECT queries never read through data from the external database
    3. Even if we implemented a CacheStore we have to load all data into Ignite to use SELECT
    4. No referential integrity enforcement
  3. It is incredibly easy to corrupt the data in Ignite persistence. We've gotten better due to operational experience but upgrades (in k8s) still on occasion lead to one or two nodes being corrupt when their pod was stopped 
I'll stop there but the point is, after 3yrs in production the team feels like they're always running up against a wall and that Ignite has created that wall.

My goal in writing this is to find out more about why the limitation around CacheStore exists, how does Ignite persistence achieve partially caching data in memory and pulling from disk if the data is not in memory and why can't that apply to a CacheStore as well?

What would it take to make it so that Ignite's SQL operations could be pushed down to a CacheStore implementation?
Ignite's a relatively large code base, so hints about which classes/interfaces to investigate if we're looking to replace Ignite persistence would be incredibly useful. My idea at the moment is to have Ignite as the in-memory SQL layer with a SQL MPP providing persistence.

To me right now the path forward is for us to put the work into removing these Ignite limitations if possible. We have a mixture of on-premise clients for our product as well as a multi-tenant SaaS version - some of these on-prem clients depend on Ignite's in-memory capabilities and so we can't easily take this away.

FYI it doesn't have to be CacheStore, I realise this just inherits the JCache interface just generally can something like CacheStore be implemented to replace the integration that Ignite persistence provides?

Regards,
Courtney Robinson
Founder and CEO, Hypi
https://hypi.io
Courtney Robinson Courtney Robinson
Reply | Threaded
Open this post in threaded view
|

Re: Ignite in-memory + other SQL store without fully loading all data into Ignite

Hey Denis,

It's been a while. Hope you've been keeping well!
A meet the second week of Jan will be great. The agenda looks good too.

We've been watching the calcite SQL engine wiki and task list but I got the impression it wouldn't be that soon. May be worth getting involved there to keep some of the calcite APIs exposed for us to be able to tailor it because my ideal design would have Ignite in front of anything we end up doing.

We'll sort out the call details closer to the time. Thanks Denis, speak to you and Val in two weeks.

Regards,
Courtney Robinson
Founder and CEO, Hypi

https://hypi.io


On Wed, Dec 30, 2020 at 12:02 AM Denis Magda <[hidden email]> wrote:
Hi Courtney,

Glad to hear from you! It's been a while since we met last time. It's truly disappointing seeing you struggle with Ignite that much. Thanks for being open and kicking the discussion off.

How about three of us (you, Val, and I) meet the second week of January and talk out the issues? Then you can share a talk summary here with a broader community, if you wish. Between us, I've been personally championing the column-type-change feature for a while, and with this IEP of Ignite 3.0 it should be doable.

In the meantime, some items for our agenda and pointers:
  • How does your upgrade procedure look like? Ignite doesn't have a rolling-upgrade feature; thus, to upgrade from version A to B in a *consistent way* the one should stop the cluster.
  • re: the frustration about spending too much time on the administration and infrastructure-related activities. Are you considering any managed service options? As a developer, I understand this frustration. I was just lucky to stay away from a need to administer Postgres, Oracle, MySQL, or Ignite just because we either had an IT administrator or made use of a managed-service option. Not selling anything here, just know that all those administration routines are unavoidable. The thing with Ignite is that distributed nature makes things more complicated. 
  • Ignite ML and SQL: in fact, you can already call Ignite ML models from a SQL query. I need to search for pointers. [hidden email], you can probably share some examples quicker.
  • Calcite and Ignite: in Q2 we're planning to release the Calcite-powered SQL engine in Ignite. That would be a dramatic improvement in our SQL capabilities. It should be possible to enable push-downs much more easily.

-
Denis


On Mon, Dec 28, 2020 at 11:50 PM Courtney Robinson <[hidden email]> wrote:
Hi Val,
Thanks. You're not missing anything and we have been using Ignite persistence for years.
Among the reasons we want to switch to in-memory only is how easily we seem to get corrupt nodes. I mentioned in the first email. We haven't had a situation where upgrading corrupts more than 1 of our 3 replicas but we genuinely fear upgrading sometimes as a result and is why we put so much effort into our backup/restore solution.

One of the arguments for disabling persistence is that other DBs seems to have had longer to solve the issue. We've operated Postgres for longer than Ignite persistence and have never had it corrupt its data on disk (admittedly it's not distributed so that could play a role) but we've also been running Cassandra for longer as well without ever having any data corruption. All 3 in Kubernetes. With Ignite we've tried many things and a few settings between Ignite rebalanceDelay (I forget which other ones) and k8s readiness/liveness probes seems to have landed in a sweet spot that's reduced how often it happens but if we happen to have any issues with the k8s control plane, scheduling delays or network issues then the chances of it skyrockets. It then requires manual intervention I believe the most frequent issue is that a node will start and the topology diverged so the cluster doesn't think it's a part of it, the most effective thing we've found is deleting the data on that node and having it join as a new node for re-balancing to kicking and push data back to it. I think we have a few tickets somewhere with details.

This is the primary motivator for wanting to replace Ignite persistence.

 Here's our use case:
We provide a low-code app development platform. As a developer you define tables in a visual interface or via GraphQL. We map these on to Ignite tables. It is a multi-tenant system complete with authentication/authorisation, serverless functions, API Gateway and more...all the bells and whistles for building modern applications in record time. We make it easy and convenient to get the scale and performance you expect from a large company with as little as the effort of an intern. 

The thing about convenience is that there are expectations set by others (the industry in this case) that the convenient things you provide include a baseline that matches industry or whoever has set the expectations. 
It is a sizeable task to implement/enforce referential integrity and delete cascade application side and I don't know of any distributed database which has this built in but the lack of transactions in SQL means we can't really implement it without pretty much implementing what the DB should be doing to keep track and rollback on failure.

In an earlier version of the platform we implemented this using the key value APIs but we were not using SQL at all. We moved to SQL 1.5yrs ago or so because we needed greater and greater query capability and we were implementing a custom query layer on the KV APIs that were starting to look remarkably close to what SQL could do.

In our use case, there are only 4 tables known at development time. Every app in the platform defines its tables dynamically and this has an impact on what Ignite capabilities are realistic to use. Including our own UI that lets users create other apps, everything is user defined.

One issue that our customers are constantly complaining about is the inability for them to change a column's type.
We tried generating Java classes, compiling and including Ignite annotations but found that the JVM ends up doing a lot more GC leading to crazy long GC pauses. Generating classes meant we could re-map the same field to a different field in the generated class and silently migrate the data on known types like int to float etc or with a user provided function to do the conversion. It proved impractical as we'd sometimes get JVMs unusable for long times and if we had this happening on 3 nodes that happen to have all 3 replicas of your data then your entire app's response time goes from 50 milliseconds to several seconds or however long it lasts for (10s of seconds sometimes).

This got worse as the JVM pauses would lead to Ignite map (I forget the exact error but it was something about failing to execute a map during a SQL query) failures on unrelated queries causing a cascade of failures across tenants.
We're investigating a new approach for this by storing mappings from user fields to different table names so we'd be able to get back the ability to change the type of a field and migrate old data. We're not sure of the overhead this will introduce yet, still being investigated. Your thoughts here are appreciated.

Next is on extending the available SQL dialect, we include indexing capabilities for our customers. Before migrating to the SQL APIs, we integrated Apache Lucene (we have a filter language that mapped to lucene queries). Indices were replicated but not partitioned and we used affinity compute to send queries to the right nodes which would send back a list of IDs we could get using the KV APIs. This worked well and provided superior filtering capabilities, migrating to the SQL APIs meant we lost this as there is no way we found to extend SQL to provide a custom function that;d be resolved to a custom affinity run, importantly, using the results of this function to reduce/filter the data coming back from SQL queries. Right now we're hacking it with some SQL operators but the search quality doesn't even compare to what we could do with the Lucene + KV APIs. This is something we're actively investigating as the current approach is showing its limits as we grow. We've also got on-premise customers set to make this unusable in Q1 as their low-code app data volumes begin to exceed what the query hacks can reasonably filter on within acceptable time. 

Right now we're looking at two options: First is using the old approach we had of affinity run over Lucene and using the results in an IN clause in a subsequent query or forking Ignite to extend the SQL dialect and integrating the lucene results in Ignite's map/filter. MVP here is doing this with all supported lucene queries over a lucene index of a 1TB table.

This last point is more of a preference than anything. We're working to bring the ease and flexibility of low-code to machine learning. Ignite's ML capabilities are pretty broad, my issue is that it's completely Java based. It would be amazing if Ignite had taken the SQL approach or had SQL integration mapped to the Java APIs (I'm thinking of Apache Madlib).

I guess one of the frustrations is we're spending a lot of time at the DB level instead of at the application level adding useful features for customers. As a small startup we don't really have the bandwidth so it's a constant battle between product features and infrastructure solutions.

We are not expecting anything to happen immediately with our Ignite setup, it'll stay for a while but how we use it may change drastically over time. We can't easily abandon the in-memory capabilities either. One suggestion being considered (to have in place by end of 2021) is using Apache Calcite as the SQL layer with an Ignite driver (mapping to in-memory Ignite). In this setup we'd use Calcite to push down writes to Ignite and Greenplum (MPP Postgres). We'd also push down queries to Ignite in our driver and where it didn't match, we'd push to Greenplum. This would give us the ability to drop in the custom SQL functions to map to Lucene or indeed Elasticsearch. This has the added benefit of getting things from Calcite like SQL window functions. We'd effectively be using Ignite in-memory to accelerate Greenplum and ES queries through a consistent SQL API whilst getting the benefits of the well established and battle tested Postgres tools and ecosystem...and SQL base ML. I'd appreciate your thoughts here on this around the suggested Ignite use. Almost everything can be pushed down to Ignite from Calcite and anything Ignite doesn't support Calcite can emulate.

Regards,
Courtney Robinson
Founder and CEO, Hypi

https://hypi.io


On Tue, Dec 29, 2020 at 4:01 AM Valentin Kulichenko <[hidden email]> wrote:
Hi Courtney,

Thanks for your feedback!

To cut the story short, Ignite implements page memory architecture. All data is split into fixed-sized pages and any page can reside either both in memory and on disk or on disk only. Since the in-memory layer and persistence layer are natively integrated - i.e. there is a unified mechanism for page lookup - Ignite storage engine always knows where a particular page is and can transparently load it into memory if needed. From this perspective, Ignite is actually very similar to how other databases work, although it has a much bigger focus on scalability and in-memory processing, richer compute capabilities, etc.

With an external database it's much more complicated - generally speaking, there is no way of knowing which data needs to be loaded from the database into memory for the purpose of an arbitrary query. This is a fundamental limitation and quite frankly, I don't think there is any other technology that can do that. Not in a consistent and performant way at least. As a matter of fact, this limitation is exactly what used to be the primary driver behind the Native Persistent development in the first place.

Is there any particular reason why you can't use Native Persistence instead of an external database? It sounds like that's what you need for your use case unless I'm missing something. Can you tell us a little bit more about your use case and requirements? I'm sure we can come up with a solution that would satisfy those requirements without a need for re-implementing the whole thing.

Thanks,
Val

On Mon, Dec 28, 2020 at 12:58 PM Courtney Robinson <[hidden email]> wrote:
I know this was over the holiday so bumping. Can anyone provide any pointers on where to start looking or anything else mentioned in the previous email?
Thanks

On Sat, Dec 26, 2020 at 8:39 PM Courtney Robinson <[hidden email]> wrote:
We've been using Ignite in production for almost 3 years and we love the platform but there are some increasingly frustrating points we run into. Before the holidays a few of our engineers started looking around and have now presented a serious case for migrating from Ignite. We would end up using at least 3 technologies they've identified to bridge the gap left by Ignite features but have presented good cases for why managing these individually would be a more flexible solution we could grow with.

I am not keen on the idea as it presents a major refactor that would likely take 6 months to get to production but I understand and agree with the points they've made. I'm trying to find a middle ground as this seems like the nuclear option to me. 

Of the top of my head some things they've raised are:
  1. Lack of tooling
    1. Inability to change a column type + no support in schema migration tools that we've found (even deleting the column we can't reuse the name)
    2. We had to build our own backup solution and even now backup has landed in 2.9 we can't use it directly because we have implemented relatively granular backup to be able to push to S3 compatible APIs (Ceph in our case) and restore partially to per hour granularity. Whilst we've done it, it took some serious engineering effort and time. We considered open sourcing it but it was done in a way that's tightly coupled to our internal stack and APIs.
  2. Inconsistency between various Ignite APIs.
    1. Transactions on KV, none of SQL (or now in beta but work seem seems to have paused?)
    2. SQL limitations - SELECT queries never read through data from the external database
    3. Even if we implemented a CacheStore we have to load all data into Ignite to use SELECT
    4. No referential integrity enforcement
  3. It is incredibly easy to corrupt the data in Ignite persistence. We've gotten better due to operational experience but upgrades (in k8s) still on occasion lead to one or two nodes being corrupt when their pod was stopped 
I'll stop there but the point is, after 3yrs in production the team feels like they're always running up against a wall and that Ignite has created that wall.

My goal in writing this is to find out more about why the limitation around CacheStore exists, how does Ignite persistence achieve partially caching data in memory and pulling from disk if the data is not in memory and why can't that apply to a CacheStore as well?

What would it take to make it so that Ignite's SQL operations could be pushed down to a CacheStore implementation?
Ignite's a relatively large code base, so hints about which classes/interfaces to investigate if we're looking to replace Ignite persistence would be incredibly useful. My idea at the moment is to have Ignite as the in-memory SQL layer with a SQL MPP providing persistence.

To me right now the path forward is for us to put the work into removing these Ignite limitations if possible. We have a mixture of on-premise clients for our product as well as a multi-tenant SaaS version - some of these on-prem clients depend on Ignite's in-memory capabilities and so we can't easily take this away.

FYI it doesn't have to be CacheStore, I realise this just inherits the JCache interface just generally can something like CacheStore be implemented to replace the integration that Ignite persistence provides?

Regards,
Courtney Robinson
Founder and CEO, Hypi
https://hypi.io