Query 3x slower with index

classic Classic list List threaded Threaded
20 messages Options
eugene miretsky eugene miretsky
Reply | Threaded
Open this post in threaded view
|

Query 3x slower with index

Hello, 

Schema:
  • PUBLIC.GATABLE2.CUSTOMER_ID

    PUBLIC.GATABLE2.DT

    PUBLIC.GATABLE2.CATEGORY_ID

    PUBLIC.GATABLE2.VERTICAL_ID

    PUBLIC.GATABLE2.SERVICE

    PUBLIC.GATABLE2.PRODUCT_VIEWS_APP

    PUBLIC.GATABLE2.PRODUCT_CLICKS_APP

    PUBLIC.GATABLE2.PRODUCT_VIEWS_WEB

    PUBLIC.GATABLE2.PRODUCT_CLICKS_WEB

    PUBLIC.GATABLE2.PDP_SESSIONS_APP

    PUBLIC.GATABLE2.PDP_SESSIONS_WEB

  • pkey = customer_id,dt
  • affinityKey = customer
Query:
  • select COUNT(*) FROM( Select customer_id from GATABLE2 where category_id in (175925, 101450, 9005, 175930, 175930, 175940,175945,101450, 6453) group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )
The table has 600M rows. 
At first, the query took 1m, when we added an index on category_id the query started taking 3m. 

The SQL execution plan for both queries is attached. 

We are using a single x1.16xlarge insntace with query parallelism set to 32 

Cheers,
Eugene


QueryWithoutIndex (1K) Download Attachment
QueryWithIndex (1K) Download Attachment
Andrew Mashenkov Andrew Mashenkov
Reply | Threaded
Open this post in threaded view
|

Re: Query 3x slower with index

Hi

Actually, first query uses index on affinity key which looks more efficient than index on category_id column.
The first query can process groups one by one and stream partial results from map phase to reduce phase as it use sorted index lookup, 
while second query should process full dataset on map phase before pass it for reducing.

Try to use composite index (customer_id, category_id).

Also, SqlQueryFields.setCollocated(true) flag can help Ignite to build more efficient plan when group by on collocated column is used.

On Sun, Sep 2, 2018 at 2:02 AM eugene miretsky <[hidden email]> wrote:
Hello, 

Schema:
  • PUBLIC.GATABLE2.CUSTOMER_ID

    PUBLIC.GATABLE2.DT

    PUBLIC.GATABLE2.CATEGORY_ID

    PUBLIC.GATABLE2.VERTICAL_ID

    PUBLIC.GATABLE2.SERVICE

    PUBLIC.GATABLE2.PRODUCT_VIEWS_APP

    PUBLIC.GATABLE2.PRODUCT_CLICKS_APP

    PUBLIC.GATABLE2.PRODUCT_VIEWS_WEB

    PUBLIC.GATABLE2.PRODUCT_CLICKS_WEB

    PUBLIC.GATABLE2.PDP_SESSIONS_APP

    PUBLIC.GATABLE2.PDP_SESSIONS_WEB

  • pkey = customer_id,dt
  • affinityKey = customer
Query:
  • select COUNT(*) FROM( Select customer_id from GATABLE2 where category_id in (175925, 101450, 9005, 175930, 175930, 175940,175945,101450, 6453) group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )
The table has 600M rows. 
At first, the query took 1m, when we added an index on category_id the query started taking 3m. 

The SQL execution plan for both queries is attached. 

We are using a single x1.16xlarge insntace with query parallelism set to 32 

Cheers,
Eugene



--
Best regards,
Andrey V. Mashenkov
Regards,
Andrew.
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Query 3x slower with index

Hello!

I don't think that we're able to use index with IN () clauses. Please convert it into OR clauses.


Regards,
--
Ilya Kasnacheev


пн, 3 сент. 2018 г. в 12:46, Andrey Mashenkov <[hidden email]>:
Hi

Actually, first query uses index on affinity key which looks more efficient than index on category_id column.
The first query can process groups one by one and stream partial results from map phase to reduce phase as it use sorted index lookup, 
while second query should process full dataset on map phase before pass it for reducing.

Try to use composite index (customer_id, category_id).

Also, SqlQueryFields.setCollocated(true) flag can help Ignite to build more efficient plan when group by on collocated column is used.

On Sun, Sep 2, 2018 at 2:02 AM eugene miretsky <[hidden email]> wrote:
Hello, 

Schema:
  • PUBLIC.GATABLE2.CUSTOMER_ID

    PUBLIC.GATABLE2.DT

    PUBLIC.GATABLE2.CATEGORY_ID

    PUBLIC.GATABLE2.VERTICAL_ID

    PUBLIC.GATABLE2.SERVICE

    PUBLIC.GATABLE2.PRODUCT_VIEWS_APP

    PUBLIC.GATABLE2.PRODUCT_CLICKS_APP

    PUBLIC.GATABLE2.PRODUCT_VIEWS_WEB

    PUBLIC.GATABLE2.PRODUCT_CLICKS_WEB

    PUBLIC.GATABLE2.PDP_SESSIONS_APP

    PUBLIC.GATABLE2.PDP_SESSIONS_WEB

  • pkey = customer_id,dt
  • affinityKey = customer
Query:
  • select COUNT(*) FROM( Select customer_id from GATABLE2 where category_id in (175925, 101450, 9005, 175930, 175930, 175940,175945,101450, 6453) group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )
The table has 600M rows. 
At first, the query took 1m, when we added an index on category_id the query started taking 3m. 

The SQL execution plan for both queries is attached. 

We are using a single x1.16xlarge insntace with query parallelism set to 32 

Cheers,
Eugene



--
Best regards,
Andrey V. Mashenkov
eugene miretsky eugene miretsky
Reply | Threaded
Open this post in threaded view
|

Re: Query 3x slower with index

Thanks! 

Tried joining with an inlined table instead of IN as per the second suggestion, and it didn't quite work. 

Query1: 
  • Select COUNT(*) FROM( Select customer_id from GATABLE3  use Index( ) where category_id in (9005, 175930, 175930, 175940,175945,101450, 6453) group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )
  • exec time = 17s
  • Result: 3105868
  • Same exec time if using AFFINITY_KEY index or "_key_PK_hash or customer_id index
  • Using an index on category_id increases the query time 33s
Query2: 
  • Select COUNT(*) FROM( Select customer_id from GATABLE3 ga  use index (PUBLIC."_key_PK") inner join table(category_id int = (9005, 175930, 175930, 175940,175945,101450, 6453)) cats on cats.category_id = ga.category_id   group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )
  • exec time = 38s
  • Result: 3113921
  • Same exec time if using AFFINITY_KEY index or "_key_PK_hash or customer_id index or category_id index
  • Using an index on category_id doesnt change the run time
Query plans are attached. 

3 questions:
  1. Why is the result differnt for the 2 queries - this is quite concerning. 
  2. Why is the 2nd query taking longer
  3. Why  category_id index doesn't work in case of query 2. 

On Wed, Sep 5, 2018 at 8:31 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

I don't think that we're able to use index with IN () clauses. Please convert it into OR clauses.


Regards,
--
Ilya Kasnacheev


пн, 3 сент. 2018 г. в 12:46, Andrey Mashenkov <[hidden email]>:
Hi

Actually, first query uses index on affinity key which looks more efficient than index on category_id column.
The first query can process groups one by one and stream partial results from map phase to reduce phase as it use sorted index lookup, 
while second query should process full dataset on map phase before pass it for reducing.

Try to use composite index (customer_id, category_id).

Also, SqlQueryFields.setCollocated(true) flag can help Ignite to build more efficient plan when group by on collocated column is used.

On Sun, Sep 2, 2018 at 2:02 AM eugene miretsky <[hidden email]> wrote:
Hello, 

Schema:
  • PUBLIC.GATABLE2.CUSTOMER_ID

    PUBLIC.GATABLE2.DT

    PUBLIC.GATABLE2.CATEGORY_ID

    PUBLIC.GATABLE2.VERTICAL_ID

    PUBLIC.GATABLE2.SERVICE

    PUBLIC.GATABLE2.PRODUCT_VIEWS_APP

    PUBLIC.GATABLE2.PRODUCT_CLICKS_APP

    PUBLIC.GATABLE2.PRODUCT_VIEWS_WEB

    PUBLIC.GATABLE2.PRODUCT_CLICKS_WEB

    PUBLIC.GATABLE2.PDP_SESSIONS_APP

    PUBLIC.GATABLE2.PDP_SESSIONS_WEB

  • pkey = customer_id,dt
  • affinityKey = customer
Query:
  • select COUNT(*) FROM( Select customer_id from GATABLE2 where category_id in (175925, 101450, 9005, 175930, 175930, 175940,175945,101450, 6453) group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )
The table has 600M rows. 
At first, the query took 1m, when we added an index on category_id the query started taking 3m. 

The SQL execution plan for both queries is attached. 

We are using a single x1.16xlarge insntace with query parallelism set to 32 

Cheers,
Eugene



--
Best regards,
Andrey V. Mashenkov

Query1_pKeyIdx (1K) Download Attachment
Query1_categoryIdIdx (1K) Download Attachment
Query2_categoryIdx (1K) Download Attachment
Query2_pKeyIdx (1K) Download Attachment
eugene miretsky eugene miretsky
Reply | Threaded
Open this post in threaded view
|

Re: Query 3x slower with index

Hello, 

Just wanted to see if anybody had time to look into this. 

Cheers,
Eugene

On Wed, Sep 12, 2018 at 6:29 PM eugene miretsky <[hidden email]> wrote:
Thanks! 

Tried joining with an inlined table instead of IN as per the second suggestion, and it didn't quite work. 

Query1: 
  • Select COUNT(*) FROM( Select customer_id from GATABLE3  use Index( ) where category_id in (9005, 175930, 175930, 175940,175945,101450, 6453) group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )
  • exec time = 17s
  • Result: 3105868
  • Same exec time if using AFFINITY_KEY index or "_key_PK_hash or customer_id index
  • Using an index on category_id increases the query time 33s
Query2: 
  • Select COUNT(*) FROM( Select customer_id from GATABLE3 ga  use index (PUBLIC."_key_PK") inner join table(category_id int = (9005, 175930, 175930, 175940,175945,101450, 6453)) cats on cats.category_id = ga.category_id   group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )
  • exec time = 38s
  • Result: 3113921
  • Same exec time if using AFFINITY_KEY index or "_key_PK_hash or customer_id index or category_id index
  • Using an index on category_id doesnt change the run time
Query plans are attached. 

3 questions:
  1. Why is the result differnt for the 2 queries - this is quite concerning. 
  2. Why is the 2nd query taking longer
  3. Why  category_id index doesn't work in case of query 2. 

On Wed, Sep 5, 2018 at 8:31 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

I don't think that we're able to use index with IN () clauses. Please convert it into OR clauses.


Regards,
--
Ilya Kasnacheev


пн, 3 сент. 2018 г. в 12:46, Andrey Mashenkov <[hidden email]>:
Hi

Actually, first query uses index on affinity key which looks more efficient than index on category_id column.
The first query can process groups one by one and stream partial results from map phase to reduce phase as it use sorted index lookup, 
while second query should process full dataset on map phase before pass it for reducing.

Try to use composite index (customer_id, category_id).

Also, SqlQueryFields.setCollocated(true) flag can help Ignite to build more efficient plan when group by on collocated column is used.

On Sun, Sep 2, 2018 at 2:02 AM eugene miretsky <[hidden email]> wrote:
Hello, 

Schema:
  • PUBLIC.GATABLE2.CUSTOMER_ID

    PUBLIC.GATABLE2.DT

    PUBLIC.GATABLE2.CATEGORY_ID

    PUBLIC.GATABLE2.VERTICAL_ID

    PUBLIC.GATABLE2.SERVICE

    PUBLIC.GATABLE2.PRODUCT_VIEWS_APP

    PUBLIC.GATABLE2.PRODUCT_CLICKS_APP

    PUBLIC.GATABLE2.PRODUCT_VIEWS_WEB

    PUBLIC.GATABLE2.PRODUCT_CLICKS_WEB

    PUBLIC.GATABLE2.PDP_SESSIONS_APP

    PUBLIC.GATABLE2.PDP_SESSIONS_WEB

  • pkey = customer_id,dt
  • affinityKey = customer
Query:
  • select COUNT(*) FROM( Select customer_id from GATABLE2 where category_id in (175925, 101450, 9005, 175930, 175930, 175940,175945,101450, 6453) group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )
The table has 600M rows. 
At first, the query took 1m, when we added an index on category_id the query started taking 3m. 

The SQL execution plan for both queries is attached. 

We are using a single x1.16xlarge insntace with query parallelism set to 32 

Cheers,
Eugene



--
Best regards,
Andrey V. Mashenkov
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Query 3x slower with index

Hello!

Why don't you diff the results of those two queries, tell us what the difference is?

Regards,
--
Ilya Kasnacheev


пн, 17 сент. 2018 г. в 16:08, eugene miretsky <[hidden email]>:
Hello, 

Just wanted to see if anybody had time to look into this. 

Cheers,
Eugene

On Wed, Sep 12, 2018 at 6:29 PM eugene miretsky <[hidden email]> wrote:
Thanks! 

Tried joining with an inlined table instead of IN as per the second suggestion, and it didn't quite work. 

Query1: 
  • Select COUNT(*) FROM( Select customer_id from GATABLE3  use Index( ) where category_id in (9005, 175930, 175930, 175940,175945,101450, 6453) group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )
  • exec time = 17s
  • Result: 3105868
  • Same exec time if using AFFINITY_KEY index or "_key_PK_hash or customer_id index
  • Using an index on category_id increases the query time 33s
Query2: 
  • Select COUNT(*) FROM( Select customer_id from GATABLE3 ga  use index (PUBLIC."_key_PK") inner join table(category_id int = (9005, 175930, 175930, 175940,175945,101450, 6453)) cats on cats.category_id = ga.category_id   group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )
  • exec time = 38s
  • Result: 3113921
  • Same exec time if using AFFINITY_KEY index or "_key_PK_hash or customer_id index or category_id index
  • Using an index on category_id doesnt change the run time
Query plans are attached. 

3 questions:
  1. Why is the result differnt for the 2 queries - this is quite concerning. 
  2. Why is the 2nd query taking longer
  3. Why  category_id index doesn't work in case of query 2. 

On Wed, Sep 5, 2018 at 8:31 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

I don't think that we're able to use index with IN () clauses. Please convert it into OR clauses.


Regards,
--
Ilya Kasnacheev


пн, 3 сент. 2018 г. в 12:46, Andrey Mashenkov <[hidden email]>:
Hi

Actually, first query uses index on affinity key which looks more efficient than index on category_id column.
The first query can process groups one by one and stream partial results from map phase to reduce phase as it use sorted index lookup, 
while second query should process full dataset on map phase before pass it for reducing.

Try to use composite index (customer_id, category_id).

Also, SqlQueryFields.setCollocated(true) flag can help Ignite to build more efficient plan when group by on collocated column is used.

On Sun, Sep 2, 2018 at 2:02 AM eugene miretsky <[hidden email]> wrote:
Hello, 

Schema:
  • PUBLIC.GATABLE2.CUSTOMER_ID

    PUBLIC.GATABLE2.DT

    PUBLIC.GATABLE2.CATEGORY_ID

    PUBLIC.GATABLE2.VERTICAL_ID

    PUBLIC.GATABLE2.SERVICE

    PUBLIC.GATABLE2.PRODUCT_VIEWS_APP

    PUBLIC.GATABLE2.PRODUCT_CLICKS_APP

    PUBLIC.GATABLE2.PRODUCT_VIEWS_WEB

    PUBLIC.GATABLE2.PRODUCT_CLICKS_WEB

    PUBLIC.GATABLE2.PDP_SESSIONS_APP

    PUBLIC.GATABLE2.PDP_SESSIONS_WEB

  • pkey = customer_id,dt
  • affinityKey = customer
Query:
  • select COUNT(*) FROM( Select customer_id from GATABLE2 where category_id in (175925, 101450, 9005, 175930, 175930, 175940,175945,101450, 6453) group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )
The table has 600M rows. 
At first, the query took 1m, when we added an index on category_id the query started taking 3m. 

The SQL execution plan for both queries is attached. 

We are using a single x1.16xlarge insntace with query parallelism set to 32 

Cheers,
Eugene



--
Best regards,
Andrey V. Mashenkov
eugene miretsky eugene miretsky
Reply | Threaded
Open this post in threaded view
|

Re: Query 3x slower with index

Hi Ilya, 

The different query result was my mistake - one of the categoy_ids was duplicate, so in the query that used join, it counted rows for that category twice. My apologies. 

However, we are still having an issue with query time, and the index not being applied to category_id. Would appreciate if you could take a look. 

Cheers,
Eugene

On Mon, Sep 17, 2018 at 9:15 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

Why don't you diff the results of those two queries, tell us what the difference is?

Regards,
--
Ilya Kasnacheev


пн, 17 сент. 2018 г. в 16:08, eugene miretsky <[hidden email]>:
Hello, 

Just wanted to see if anybody had time to look into this. 

Cheers,
Eugene

On Wed, Sep 12, 2018 at 6:29 PM eugene miretsky <[hidden email]> wrote:
Thanks! 

Tried joining with an inlined table instead of IN as per the second suggestion, and it didn't quite work. 

Query1: 
  • Select COUNT(*) FROM( Select customer_id from GATABLE3  use Index( ) where category_id in (9005, 175930, 175930, 175940,175945,101450, 6453) group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )
  • exec time = 17s
  • Result: 3105868
  • Same exec time if using AFFINITY_KEY index or "_key_PK_hash or customer_id index
  • Using an index on category_id increases the query time 33s
Query2: 
  • Select COUNT(*) FROM( Select customer_id from GATABLE3 ga  use index (PUBLIC."_key_PK") inner join table(category_id int = (9005, 175930, 175930, 175940,175945,101450, 6453)) cats on cats.category_id = ga.category_id   group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )
  • exec time = 38s
  • Result: 3113921
  • Same exec time if using AFFINITY_KEY index or "_key_PK_hash or customer_id index or category_id index
  • Using an index on category_id doesnt change the run time
Query plans are attached. 

3 questions:
  1. Why is the result differnt for the 2 queries - this is quite concerning. 
  2. Why is the 2nd query taking longer
  3. Why  category_id index doesn't work in case of query 2. 

On Wed, Sep 5, 2018 at 8:31 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

I don't think that we're able to use index with IN () clauses. Please convert it into OR clauses.


Regards,
--
Ilya Kasnacheev


пн, 3 сент. 2018 г. в 12:46, Andrey Mashenkov <[hidden email]>:
Hi

Actually, first query uses index on affinity key which looks more efficient than index on category_id column.
The first query can process groups one by one and stream partial results from map phase to reduce phase as it use sorted index lookup, 
while second query should process full dataset on map phase before pass it for reducing.

Try to use composite index (customer_id, category_id).

Also, SqlQueryFields.setCollocated(true) flag can help Ignite to build more efficient plan when group by on collocated column is used.

On Sun, Sep 2, 2018 at 2:02 AM eugene miretsky <[hidden email]> wrote:
Hello, 

Schema:
  • PUBLIC.GATABLE2.CUSTOMER_ID

    PUBLIC.GATABLE2.DT

    PUBLIC.GATABLE2.CATEGORY_ID

    PUBLIC.GATABLE2.VERTICAL_ID

    PUBLIC.GATABLE2.SERVICE

    PUBLIC.GATABLE2.PRODUCT_VIEWS_APP

    PUBLIC.GATABLE2.PRODUCT_CLICKS_APP

    PUBLIC.GATABLE2.PRODUCT_VIEWS_WEB

    PUBLIC.GATABLE2.PRODUCT_CLICKS_WEB

    PUBLIC.GATABLE2.PDP_SESSIONS_APP

    PUBLIC.GATABLE2.PDP_SESSIONS_WEB

  • pkey = customer_id,dt
  • affinityKey = customer
Query:
  • select COUNT(*) FROM( Select customer_id from GATABLE2 where category_id in (175925, 101450, 9005, 175930, 175930, 175940,175945,101450, 6453) group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )
The table has 600M rows. 
At first, the query took 1m, when we added an index on category_id the query started taking 3m. 

The SQL execution plan for both queries is attached. 

We are using a single x1.16xlarge insntace with query parallelism set to 32 

Cheers,
Eugene



--
Best regards,
Andrey V. Mashenkov
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Query 3x slower with index

Hello!

I can see you try to use _key_PK as index. If your primary key is composite, it won't work properly for you. I recommend creating an explicit (category_id, customer_id) index.

Regards,
--
Ilya Kasnacheev


вт, 18 сент. 2018 г. в 17:47, eugene miretsky <[hidden email]>:
Hi Ilya, 

The different query result was my mistake - one of the categoy_ids was duplicate, so in the query that used join, it counted rows for that category twice. My apologies. 

However, we are still having an issue with query time, and the index not being applied to category_id. Would appreciate if you could take a look. 

Cheers,
Eugene

On Mon, Sep 17, 2018 at 9:15 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

Why don't you diff the results of those two queries, tell us what the difference is?

Regards,
--
Ilya Kasnacheev


пн, 17 сент. 2018 г. в 16:08, eugene miretsky <[hidden email]>:
Hello, 

Just wanted to see if anybody had time to look into this. 

Cheers,
Eugene

On Wed, Sep 12, 2018 at 6:29 PM eugene miretsky <[hidden email]> wrote:
Thanks! 

Tried joining with an inlined table instead of IN as per the second suggestion, and it didn't quite work. 

Query1: 
  • Select COUNT(*) FROM( Select customer_id from GATABLE3  use Index( ) where category_id in (9005, 175930, 175930, 175940,175945,101450, 6453) group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )
  • exec time = 17s
  • Result: 3105868
  • Same exec time if using AFFINITY_KEY index or "_key_PK_hash or customer_id index
  • Using an index on category_id increases the query time 33s
Query2: 
  • Select COUNT(*) FROM( Select customer_id from GATABLE3 ga  use index (PUBLIC."_key_PK") inner join table(category_id int = (9005, 175930, 175930, 175940,175945,101450, 6453)) cats on cats.category_id = ga.category_id   group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )
  • exec time = 38s
  • Result: 3113921
  • Same exec time if using AFFINITY_KEY index or "_key_PK_hash or customer_id index or category_id index
  • Using an index on category_id doesnt change the run time
Query plans are attached. 

3 questions:
  1. Why is the result differnt for the 2 queries - this is quite concerning. 
  2. Why is the 2nd query taking longer
  3. Why  category_id index doesn't work in case of query 2. 

On Wed, Sep 5, 2018 at 8:31 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

I don't think that we're able to use index with IN () clauses. Please convert it into OR clauses.


Regards,
--
Ilya Kasnacheev


пн, 3 сент. 2018 г. в 12:46, Andrey Mashenkov <[hidden email]>:
Hi

Actually, first query uses index on affinity key which looks more efficient than index on category_id column.
The first query can process groups one by one and stream partial results from map phase to reduce phase as it use sorted index lookup, 
while second query should process full dataset on map phase before pass it for reducing.

Try to use composite index (customer_id, category_id).

Also, SqlQueryFields.setCollocated(true) flag can help Ignite to build more efficient plan when group by on collocated column is used.

On Sun, Sep 2, 2018 at 2:02 AM eugene miretsky <[hidden email]> wrote:
Hello, 

Schema:
  • PUBLIC.GATABLE2.CUSTOMER_ID

    PUBLIC.GATABLE2.DT

    PUBLIC.GATABLE2.CATEGORY_ID

    PUBLIC.GATABLE2.VERTICAL_ID

    PUBLIC.GATABLE2.SERVICE

    PUBLIC.GATABLE2.PRODUCT_VIEWS_APP

    PUBLIC.GATABLE2.PRODUCT_CLICKS_APP

    PUBLIC.GATABLE2.PRODUCT_VIEWS_WEB

    PUBLIC.GATABLE2.PRODUCT_CLICKS_WEB

    PUBLIC.GATABLE2.PDP_SESSIONS_APP

    PUBLIC.GATABLE2.PDP_SESSIONS_WEB

  • pkey = customer_id,dt
  • affinityKey = customer
Query:
  • select COUNT(*) FROM( Select customer_id from GATABLE2 where category_id in (175925, 101450, 9005, 175930, 175930, 175940,175945,101450, 6453) group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )
The table has 600M rows. 
At first, the query took 1m, when we added an index on category_id the query started taking 3m. 

The SQL execution plan for both queries is attached. 

We are using a single x1.16xlarge insntace with query parallelism set to 32 

Cheers,
Eugene



--
Best regards,
Andrey V. Mashenkov
eugene miretsky eugene miretsky
Reply | Threaded
Open this post in threaded view
|

Re: Query 3x slower with index

Hi Ilya, 

I created 4 indexs on the table:
1) ga_pKey: on customer_id, dt, category_id (that's our primary key columns)
2) ga_customer_and_category_id: on customer_id and category_id
2) ga_customer_id: on customer_id
4) ga_category_id: on category_id


For the first query (category in ()), the execution plan when using the first 3 index is exactly the same  - using /* PUBLIC.AFFINITY_KEY */
When using #4 (alone or in combination with any of the other 3)
  1. /* PUBLIC.AFFINITY_KEY */ is replaced with  /* PUBLIC.GA_CATEGORY_ID: CATEGORY_ID IN(117930, 175930, 175940, 175945, 101450) */
  2. The query runs slower.
For the second query (join on an inlined table) the behaviour is very similar. Using the first 3 indexes results in the same plan - using  /* PUBLIC.AFFINITY_KEY */ and  /* function: CATEGORY_ID = GA__Z0.CATEGORY_ID */. 
When using #4 (alone or in combination with any of the other 3)
  1. /* function */ and /* PUBLIC.GA_CATEGORY_ID: CATEGORY_ID = CATS__Z1.CATEGORY_ID */ are used
  2. The query is much slower. 

Theoretically the query seems pretty simple
  1. Use affinity key  to make sure the query runs in parallel and there are no shuffles 
  2. Filter rows that match category_id using the category_id index
  3. Used customer_id index for the group_by (not sure if this step makes sense)
But I cannot get it to work.

Cheers,
Eugene




On Tue, Sep 18, 2018 at 10:56 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

I can see you try to use _key_PK as index. If your primary key is composite, it won't work properly for you. I recommend creating an explicit (category_id, customer_id) index.

Regards,
--
Ilya Kasnacheev


вт, 18 сент. 2018 г. в 17:47, eugene miretsky <[hidden email]>:
Hi Ilya, 

The different query result was my mistake - one of the categoy_ids was duplicate, so in the query that used join, it counted rows for that category twice. My apologies. 

However, we are still having an issue with query time, and the index not being applied to category_id. Would appreciate if you could take a look. 

Cheers,
Eugene

On Mon, Sep 17, 2018 at 9:15 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

Why don't you diff the results of those two queries, tell us what the difference is?

Regards,
--
Ilya Kasnacheev


пн, 17 сент. 2018 г. в 16:08, eugene miretsky <[hidden email]>:
Hello, 

Just wanted to see if anybody had time to look into this. 

Cheers,
Eugene

On Wed, Sep 12, 2018 at 6:29 PM eugene miretsky <[hidden email]> wrote:
Thanks! 

Tried joining with an inlined table instead of IN as per the second suggestion, and it didn't quite work. 

Query1: 
  • Select COUNT(*) FROM( Select customer_id from GATABLE3  use Index( ) where category_id in (9005, 175930, 175930, 175940,175945,101450, 6453) group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )
  • exec time = 17s
  • Result: 3105868
  • Same exec time if using AFFINITY_KEY index or "_key_PK_hash or customer_id index
  • Using an index on category_id increases the query time 33s
Query2: 
  • Select COUNT(*) FROM( Select customer_id from GATABLE3 ga  use index (PUBLIC."_key_PK") inner join table(category_id int = (9005, 175930, 175930, 175940,175945,101450, 6453)) cats on cats.category_id = ga.category_id   group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )
  • exec time = 38s
  • Result: 3113921
  • Same exec time if using AFFINITY_KEY index or "_key_PK_hash or customer_id index or category_id index
  • Using an index on category_id doesnt change the run time
Query plans are attached. 

3 questions:
  1. Why is the result differnt for the 2 queries - this is quite concerning. 
  2. Why is the 2nd query taking longer
  3. Why  category_id index doesn't work in case of query 2. 

On Wed, Sep 5, 2018 at 8:31 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

I don't think that we're able to use index with IN () clauses. Please convert it into OR clauses.


Regards,
--
Ilya Kasnacheev


пн, 3 сент. 2018 г. в 12:46, Andrey Mashenkov <[hidden email]>:
Hi

Actually, first query uses index on affinity key which looks more efficient than index on category_id column.
The first query can process groups one by one and stream partial results from map phase to reduce phase as it use sorted index lookup, 
while second query should process full dataset on map phase before pass it for reducing.

Try to use composite index (customer_id, category_id).

Also, SqlQueryFields.setCollocated(true) flag can help Ignite to build more efficient plan when group by on collocated column is used.

On Sun, Sep 2, 2018 at 2:02 AM eugene miretsky <[hidden email]> wrote:
Hello, 

Schema:
  • PUBLIC.GATABLE2.CUSTOMER_ID

    PUBLIC.GATABLE2.DT

    PUBLIC.GATABLE2.CATEGORY_ID

    PUBLIC.GATABLE2.VERTICAL_ID

    PUBLIC.GATABLE2.SERVICE

    PUBLIC.GATABLE2.PRODUCT_VIEWS_APP

    PUBLIC.GATABLE2.PRODUCT_CLICKS_APP

    PUBLIC.GATABLE2.PRODUCT_VIEWS_WEB

    PUBLIC.GATABLE2.PRODUCT_CLICKS_WEB

    PUBLIC.GATABLE2.PDP_SESSIONS_APP

    PUBLIC.GATABLE2.PDP_SESSIONS_WEB

  • pkey = customer_id,dt
  • affinityKey = customer
Query:
  • select COUNT(*) FROM( Select customer_id from GATABLE2 where category_id in (175925, 101450, 9005, 175930, 175930, 175940,175945,101450, 6453) group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )
The table has 600M rows. 
At first, the query took 1m, when we added an index on category_id the query started taking 3m. 

The SQL execution plan for both queries is attached. 

We are using a single x1.16xlarge insntace with query parallelism set to 32 

Cheers,
Eugene



--
Best regards,
Andrey V. Mashenkov
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Query 3x slower with index

Hello!

> 2) ga_customer_and_category_id: on customer_id and category_id

Have you tried to do an index on category_id first, customer_id second? Note that Ignite will use only one index when joining two tables and that in your case it should start with category_id.

You can also try adding affinity key to this index in various places, see if it helps further.

Regards,
--
Ilya Kasnacheev


ср, 19 сент. 2018 г. в 21:27, eugene miretsky <[hidden email]>:
Hi Ilya, 

I created 4 indexs on the table:
1) ga_pKey: on customer_id, dt, category_id (that's our primary key columns)
2) ga_customer_and_category_id: on customer_id and category_id
2) ga_customer_id: on customer_id
4) ga_category_id: on category_id


For the first query (category in ()), the execution plan when using the first 3 index is exactly the same  - using /* PUBLIC.AFFINITY_KEY */
When using #4 (alone or in combination with any of the other 3)
  1. /* PUBLIC.AFFINITY_KEY */ is replaced with  /* PUBLIC.GA_CATEGORY_ID: CATEGORY_ID IN(117930, 175930, 175940, 175945, 101450) */
  2. The query runs slower.
For the second query (join on an inlined table) the behaviour is very similar. Using the first 3 indexes results in the same plan - using  /* PUBLIC.AFFINITY_KEY */ and  /* function: CATEGORY_ID = GA__Z0.CATEGORY_ID */. 
When using #4 (alone or in combination with any of the other 3)
  1. /* function */ and /* PUBLIC.GA_CATEGORY_ID: CATEGORY_ID = CATS__Z1.CATEGORY_ID */ are used
  2. The query is much slower. 

Theoretically the query seems pretty simple
  1. Use affinity key  to make sure the query runs in parallel and there are no shuffles 
  2. Filter rows that match category_id using the category_id index
  3. Used customer_id index for the group_by (not sure if this step makes sense)
But I cannot get it to work.

Cheers,
Eugene




On Tue, Sep 18, 2018 at 10:56 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

I can see you try to use _key_PK as index. If your primary key is composite, it won't work properly for you. I recommend creating an explicit (category_id, customer_id) index.

Regards,
--
Ilya Kasnacheev


вт, 18 сент. 2018 г. в 17:47, eugene miretsky <[hidden email]>:
Hi Ilya, 

The different query result was my mistake - one of the categoy_ids was duplicate, so in the query that used join, it counted rows for that category twice. My apologies. 

However, we are still having an issue with query time, and the index not being applied to category_id. Would appreciate if you could take a look. 

Cheers,
Eugene

On Mon, Sep 17, 2018 at 9:15 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

Why don't you diff the results of those two queries, tell us what the difference is?

Regards,
--
Ilya Kasnacheev


пн, 17 сент. 2018 г. в 16:08, eugene miretsky <[hidden email]>:
Hello, 

Just wanted to see if anybody had time to look into this. 

Cheers,
Eugene

On Wed, Sep 12, 2018 at 6:29 PM eugene miretsky <[hidden email]> wrote:
Thanks! 

Tried joining with an inlined table instead of IN as per the second suggestion, and it didn't quite work. 

Query1: 
  • Select COUNT(*) FROM( Select customer_id from GATABLE3  use Index( ) where category_id in (9005, 175930, 175930, 175940,175945,101450, 6453) group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )
  • exec time = 17s
  • Result: 3105868
  • Same exec time if using AFFINITY_KEY index or "_key_PK_hash or customer_id index
  • Using an index on category_id increases the query time 33s
Query2: 
  • Select COUNT(*) FROM( Select customer_id from GATABLE3 ga  use index (PUBLIC."_key_PK") inner join table(category_id int = (9005, 175930, 175930, 175940,175945,101450, 6453)) cats on cats.category_id = ga.category_id   group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )
  • exec time = 38s
  • Result: 3113921
  • Same exec time if using AFFINITY_KEY index or "_key_PK_hash or customer_id index or category_id index
  • Using an index on category_id doesnt change the run time
Query plans are attached. 

3 questions:
  1. Why is the result differnt for the 2 queries - this is quite concerning. 
  2. Why is the 2nd query taking longer
  3. Why  category_id index doesn't work in case of query 2. 

On Wed, Sep 5, 2018 at 8:31 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

I don't think that we're able to use index with IN () clauses. Please convert it into OR clauses.


Regards,
--
Ilya Kasnacheev


пн, 3 сент. 2018 г. в 12:46, Andrey Mashenkov <[hidden email]>:
Hi

Actually, first query uses index on affinity key which looks more efficient than index on category_id column.
The first query can process groups one by one and stream partial results from map phase to reduce phase as it use sorted index lookup, 
while second query should process full dataset on map phase before pass it for reducing.

Try to use composite index (customer_id, category_id).

Also, SqlQueryFields.setCollocated(true) flag can help Ignite to build more efficient plan when group by on collocated column is used.

On Sun, Sep 2, 2018 at 2:02 AM eugene miretsky <[hidden email]> wrote:
Hello, 

Schema:
  • PUBLIC.GATABLE2.CUSTOMER_ID

    PUBLIC.GATABLE2.DT

    PUBLIC.GATABLE2.CATEGORY_ID

    PUBLIC.GATABLE2.VERTICAL_ID

    PUBLIC.GATABLE2.SERVICE

    PUBLIC.GATABLE2.PRODUCT_VIEWS_APP

    PUBLIC.GATABLE2.PRODUCT_CLICKS_APP

    PUBLIC.GATABLE2.PRODUCT_VIEWS_WEB

    PUBLIC.GATABLE2.PRODUCT_CLICKS_WEB

    PUBLIC.GATABLE2.PDP_SESSIONS_APP

    PUBLIC.GATABLE2.PDP_SESSIONS_WEB

  • pkey = customer_id,dt
  • affinityKey = customer
Query:
  • select COUNT(*) FROM( Select customer_id from GATABLE2 where category_id in (175925, 101450, 9005, 175930, 175930, 175940,175945,101450, 6453) group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )
The table has 600M rows. 
At first, the query took 1m, when we added an index on category_id the query started taking 3m. 

The SQL execution plan for both queries is attached. 

We are using a single x1.16xlarge insntace with query parallelism set to 32 

Cheers,
Eugene



--
Best regards,
Andrey V. Mashenkov
eugene miretsky eugene miretsky
Reply | Threaded
Open this post in threaded view
|

Re: Query 3x slower with index

Thanks Ilya, 

Tried it, no luck. It performs the same as when using category_id index alone (slow).
  Any combindation I try either uses AFFINITY_KEY or category index. When it uses category index it runs slowers. 

Also, when AFFINITY_KEY key is used, the jobs runs on 32 threads (my query parallelism settings ) when category_id is used, the jobs runs on one thread most of the time (first few seconds it looks like more threads are doing work). 

Please help on this. It seems like a very simple use case (using affinity key and another index), either I am doing something extremly silly, or I stumbled on a bug in Ignite that's effecting a lot of people.

Cheers,
Eugene

On Thu, Sep 20, 2018 at 6:22 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

> 2) ga_customer_and_category_id: on customer_id and category_id

Have you tried to do an index on category_id first, customer_id second? Note that Ignite will use only one index when joining two tables and that in your case it should start with category_id.

You can also try adding affinity key to this index in various places, see if it helps further.

Regards,
--
Ilya Kasnacheev


ср, 19 сент. 2018 г. в 21:27, eugene miretsky <[hidden email]>:
Hi Ilya, 

I created 4 indexs on the table:
1) ga_pKey: on customer_id, dt, category_id (that's our primary key columns)
2) ga_customer_and_category_id: on customer_id and category_id
2) ga_customer_id: on customer_id
4) ga_category_id: on category_id


For the first query (category in ()), the execution plan when using the first 3 index is exactly the same  - using /* PUBLIC.AFFINITY_KEY */
When using #4 (alone or in combination with any of the other 3)
  1. /* PUBLIC.AFFINITY_KEY */ is replaced with  /* PUBLIC.GA_CATEGORY_ID: CATEGORY_ID IN(117930, 175930, 175940, 175945, 101450) */
  2. The query runs slower.
For the second query (join on an inlined table) the behaviour is very similar. Using the first 3 indexes results in the same plan - using  /* PUBLIC.AFFINITY_KEY */ and  /* function: CATEGORY_ID = GA__Z0.CATEGORY_ID */. 
When using #4 (alone or in combination with any of the other 3)
  1. /* function */ and /* PUBLIC.GA_CATEGORY_ID: CATEGORY_ID = CATS__Z1.CATEGORY_ID */ are used
  2. The query is much slower. 

Theoretically the query seems pretty simple
  1. Use affinity key  to make sure the query runs in parallel and there are no shuffles 
  2. Filter rows that match category_id using the category_id index
  3. Used customer_id index for the group_by (not sure if this step makes sense)
But I cannot get it to work.

Cheers,
Eugene




On Tue, Sep 18, 2018 at 10:56 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

I can see you try to use _key_PK as index. If your primary key is composite, it won't work properly for you. I recommend creating an explicit (category_id, customer_id) index.

Regards,
--
Ilya Kasnacheev


вт, 18 сент. 2018 г. в 17:47, eugene miretsky <[hidden email]>:
Hi Ilya, 

The different query result was my mistake - one of the categoy_ids was duplicate, so in the query that used join, it counted rows for that category twice. My apologies. 

However, we are still having an issue with query time, and the index not being applied to category_id. Would appreciate if you could take a look. 

Cheers,
Eugene

On Mon, Sep 17, 2018 at 9:15 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

Why don't you diff the results of those two queries, tell us what the difference is?

Regards,
--
Ilya Kasnacheev


пн, 17 сент. 2018 г. в 16:08, eugene miretsky <[hidden email]>:
Hello, 

Just wanted to see if anybody had time to look into this. 

Cheers,
Eugene

On Wed, Sep 12, 2018 at 6:29 PM eugene miretsky <[hidden email]> wrote:
Thanks! 

Tried joining with an inlined table instead of IN as per the second suggestion, and it didn't quite work. 

Query1: 
  • Select COUNT(*) FROM( Select customer_id from GATABLE3  use Index( ) where category_id in (9005, 175930, 175930, 175940,175945,101450, 6453) group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )
  • exec time = 17s
  • Result: 3105868
  • Same exec time if using AFFINITY_KEY index or "_key_PK_hash or customer_id index
  • Using an index on category_id increases the query time 33s
Query2: 
  • Select COUNT(*) FROM( Select customer_id from GATABLE3 ga  use index (PUBLIC."_key_PK") inner join table(category_id int = (9005, 175930, 175930, 175940,175945,101450, 6453)) cats on cats.category_id = ga.category_id   group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )
  • exec time = 38s
  • Result: 3113921
  • Same exec time if using AFFINITY_KEY index or "_key_PK_hash or customer_id index or category_id index
  • Using an index on category_id doesnt change the run time
Query plans are attached. 

3 questions:
  1. Why is the result differnt for the 2 queries - this is quite concerning. 
  2. Why is the 2nd query taking longer
  3. Why  category_id index doesn't work in case of query 2. 

On Wed, Sep 5, 2018 at 8:31 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

I don't think that we're able to use index with IN () clauses. Please convert it into OR clauses.


Regards,
--
Ilya Kasnacheev


пн, 3 сент. 2018 г. в 12:46, Andrey Mashenkov <[hidden email]>:
Hi

Actually, first query uses index on affinity key which looks more efficient than index on category_id column.
The first query can process groups one by one and stream partial results from map phase to reduce phase as it use sorted index lookup, 
while second query should process full dataset on map phase before pass it for reducing.

Try to use composite index (customer_id, category_id).

Also, SqlQueryFields.setCollocated(true) flag can help Ignite to build more efficient plan when group by on collocated column is used.

On Sun, Sep 2, 2018 at 2:02 AM eugene miretsky <[hidden email]> wrote:
Hello, 

Schema:
  • PUBLIC.GATABLE2.CUSTOMER_ID

    PUBLIC.GATABLE2.DT

    PUBLIC.GATABLE2.CATEGORY_ID

    PUBLIC.GATABLE2.VERTICAL_ID

    PUBLIC.GATABLE2.SERVICE

    PUBLIC.GATABLE2.PRODUCT_VIEWS_APP

    PUBLIC.GATABLE2.PRODUCT_CLICKS_APP

    PUBLIC.GATABLE2.PRODUCT_VIEWS_WEB

    PUBLIC.GATABLE2.PRODUCT_CLICKS_WEB

    PUBLIC.GATABLE2.PDP_SESSIONS_APP

    PUBLIC.GATABLE2.PDP_SESSIONS_WEB

  • pkey = customer_id,dt
  • affinityKey = customer
Query:
  • select COUNT(*) FROM( Select customer_id from GATABLE2 where category_id in (175925, 101450, 9005, 175930, 175930, 175940,175945,101450, 6453) group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )
The table has 600M rows. 
At first, the query took 1m, when we added an index on category_id the query started taking 3m. 

The SQL execution plan for both queries is attached. 

We are using a single x1.16xlarge insntace with query parallelism set to 32 

Cheers,
Eugene



--
Best regards,
Andrey V. Mashenkov
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Query 3x slower with index

Hello!

Can you share a reproducer project which loads (or generates) data for caches and then queries them? I could try and debug it if I had the reproducer.

Regards.
--
Ilya Kasnacheev


чт, 20 сент. 2018 г. в 21:05, eugene miretsky <[hidden email]>:
Thanks Ilya, 

Tried it, no luck. It performs the same as when using category_id index alone (slow).
  Any combindation I try either uses AFFINITY_KEY or category index. When it uses category index it runs slowers. 

Also, when AFFINITY_KEY key is used, the jobs runs on 32 threads (my query parallelism settings ) when category_id is used, the jobs runs on one thread most of the time (first few seconds it looks like more threads are doing work). 

Please help on this. It seems like a very simple use case (using affinity key and another index), either I am doing something extremly silly, or I stumbled on a bug in Ignite that's effecting a lot of people.

Cheers,
Eugene

On Thu, Sep 20, 2018 at 6:22 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

> 2) ga_customer_and_category_id: on customer_id and category_id

Have you tried to do an index on category_id first, customer_id second? Note that Ignite will use only one index when joining two tables and that in your case it should start with category_id.

You can also try adding affinity key to this index in various places, see if it helps further.

Regards,
--
Ilya Kasnacheev


ср, 19 сент. 2018 г. в 21:27, eugene miretsky <[hidden email]>:
Hi Ilya, 

I created 4 indexs on the table:
1) ga_pKey: on customer_id, dt, category_id (that's our primary key columns)
2) ga_customer_and_category_id: on customer_id and category_id
2) ga_customer_id: on customer_id
4) ga_category_id: on category_id


For the first query (category in ()), the execution plan when using the first 3 index is exactly the same  - using /* PUBLIC.AFFINITY_KEY */
When using #4 (alone or in combination with any of the other 3)
  1. /* PUBLIC.AFFINITY_KEY */ is replaced with  /* PUBLIC.GA_CATEGORY_ID: CATEGORY_ID IN(117930, 175930, 175940, 175945, 101450) */
  2. The query runs slower.
For the second query (join on an inlined table) the behaviour is very similar. Using the first 3 indexes results in the same plan - using  /* PUBLIC.AFFINITY_KEY */ and  /* function: CATEGORY_ID = GA__Z0.CATEGORY_ID */. 
When using #4 (alone or in combination with any of the other 3)
  1. /* function */ and /* PUBLIC.GA_CATEGORY_ID: CATEGORY_ID = CATS__Z1.CATEGORY_ID */ are used
  2. The query is much slower. 

Theoretically the query seems pretty simple
  1. Use affinity key  to make sure the query runs in parallel and there are no shuffles 
  2. Filter rows that match category_id using the category_id index
  3. Used customer_id index for the group_by (not sure if this step makes sense)
But I cannot get it to work.

Cheers,
Eugene




On Tue, Sep 18, 2018 at 10:56 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

I can see you try to use _key_PK as index. If your primary key is composite, it won't work properly for you. I recommend creating an explicit (category_id, customer_id) index.

Regards,
--
Ilya Kasnacheev


вт, 18 сент. 2018 г. в 17:47, eugene miretsky <[hidden email]>:
Hi Ilya, 

The different query result was my mistake - one of the categoy_ids was duplicate, so in the query that used join, it counted rows for that category twice. My apologies. 

However, we are still having an issue with query time, and the index not being applied to category_id. Would appreciate if you could take a look. 

Cheers,
Eugene

On Mon, Sep 17, 2018 at 9:15 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

Why don't you diff the results of those two queries, tell us what the difference is?

Regards,
--
Ilya Kasnacheev


пн, 17 сент. 2018 г. в 16:08, eugene miretsky <[hidden email]>:
Hello, 

Just wanted to see if anybody had time to look into this. 

Cheers,
Eugene

On Wed, Sep 12, 2018 at 6:29 PM eugene miretsky <[hidden email]> wrote:
Thanks! 

Tried joining with an inlined table instead of IN as per the second suggestion, and it didn't quite work. 

Query1: 
  • Select COUNT(*) FROM( Select customer_id from GATABLE3  use Index( ) where category_id in (9005, 175930, 175930, 175940,175945,101450, 6453) group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )
  • exec time = 17s
  • Result: 3105868
  • Same exec time if using AFFINITY_KEY index or "_key_PK_hash or customer_id index
  • Using an index on category_id increases the query time 33s
Query2: 
  • Select COUNT(*) FROM( Select customer_id from GATABLE3 ga  use index (PUBLIC."_key_PK") inner join table(category_id int = (9005, 175930, 175930, 175940,175945,101450, 6453)) cats on cats.category_id = ga.category_id   group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )
  • exec time = 38s
  • Result: 3113921
  • Same exec time if using AFFINITY_KEY index or "_key_PK_hash or customer_id index or category_id index
  • Using an index on category_id doesnt change the run time
Query plans are attached. 

3 questions:
  1. Why is the result differnt for the 2 queries - this is quite concerning. 
  2. Why is the 2nd query taking longer
  3. Why  category_id index doesn't work in case of query 2. 

On Wed, Sep 5, 2018 at 8:31 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

I don't think that we're able to use index with IN () clauses. Please convert it into OR clauses.


Regards,
--
Ilya Kasnacheev


пн, 3 сент. 2018 г. в 12:46, Andrey Mashenkov <[hidden email]>:
Hi

Actually, first query uses index on affinity key which looks more efficient than index on category_id column.
The first query can process groups one by one and stream partial results from map phase to reduce phase as it use sorted index lookup, 
while second query should process full dataset on map phase before pass it for reducing.

Try to use composite index (customer_id, category_id).

Also, SqlQueryFields.setCollocated(true) flag can help Ignite to build more efficient plan when group by on collocated column is used.

On Sun, Sep 2, 2018 at 2:02 AM eugene miretsky <[hidden email]> wrote:
Hello, 

Schema:
  • PUBLIC.GATABLE2.CUSTOMER_ID

    PUBLIC.GATABLE2.DT

    PUBLIC.GATABLE2.CATEGORY_ID

    PUBLIC.GATABLE2.VERTICAL_ID

    PUBLIC.GATABLE2.SERVICE

    PUBLIC.GATABLE2.PRODUCT_VIEWS_APP

    PUBLIC.GATABLE2.PRODUCT_CLICKS_APP

    PUBLIC.GATABLE2.PRODUCT_VIEWS_WEB

    PUBLIC.GATABLE2.PRODUCT_CLICKS_WEB

    PUBLIC.GATABLE2.PDP_SESSIONS_APP

    PUBLIC.GATABLE2.PDP_SESSIONS_WEB

  • pkey = customer_id,dt
  • affinityKey = customer
Query:
  • select COUNT(*) FROM( Select customer_id from GATABLE2 where category_id in (175925, 101450, 9005, 175930, 175930, 175940,175945,101450, 6453) group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )
The table has 600M rows. 
At first, the query took 1m, when we added an index on category_id the query started taking 3m. 

The SQL execution plan for both queries is attached. 

We are using a single x1.16xlarge insntace with query parallelism set to 32 

Cheers,
Eugene



--
Best regards,
Andrey V. Mashenkov
eugene miretsky eugene miretsky
Reply | Threaded
Open this post in threaded view
|

Re: Query 3x slower with index

An easy way to reproduce would be to 

1. Create table
CREATE TABLE GA_DATA (
    customer_id bigint,
    dt timestamp,
    category_id int,
    product_views_app int,
    product_clict_app int,
    product_clict_web int,
    product_clict_web int,
    PRIMARY KEY (customer_id, dt, category_id)
) WITH "template=ga_template, backups=0, affinityKey=customer_id";

2. Create indexes
  • CREATE INDEX ga_customer_id ON GA_Data (customer_id)
  • CREATE INDEX ga_pKey ON GA_Data (customer_id, dt, category_id)
  • CREATE INDEX ga_category_and_customer_id ON GA_Data (category_id, customer_id)
  • CREATE INDEX ga_category_id ON GA_Data (category_id)
3. Run Explain on the following queries while trying forcing using different indexes
  • Select count(*) FROM( 
Select customer_id from GA_DATA  use index (ga_category_id)
where category_id in (117930, 175930, 175940,175945,101450) 
group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )

  • Select count(*) FROM( 
    Select customer_id from GA_DATA ga use index (ga_pKey)
    join table(category_id int = ( 117930, 175930, 175940,175945,101450)) cats on cats.category_id = ga.category_id   
    group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 

The execution plans will be similar to what I have posted earler. In particular, only on of (a) affinty key index, (b) category_id index will be used.

On Fri, Sep 21, 2018 at 8:49 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

Can you share a reproducer project which loads (or generates) data for caches and then queries them? I could try and debug it if I had the reproducer.

Regards.
--
Ilya Kasnacheev


чт, 20 сент. 2018 г. в 21:05, eugene miretsky <[hidden email]>:
Thanks Ilya, 

Tried it, no luck. It performs the same as when using category_id index alone (slow).
  Any combindation I try either uses AFFINITY_KEY or category index. When it uses category index it runs slowers. 

Also, when AFFINITY_KEY key is used, the jobs runs on 32 threads (my query parallelism settings ) when category_id is used, the jobs runs on one thread most of the time (first few seconds it looks like more threads are doing work). 

Please help on this. It seems like a very simple use case (using affinity key and another index), either I am doing something extremly silly, or I stumbled on a bug in Ignite that's effecting a lot of people.

Cheers,
Eugene

On Thu, Sep 20, 2018 at 6:22 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

> 2) ga_customer_and_category_id: on customer_id and category_id

Have you tried to do an index on category_id first, customer_id second? Note that Ignite will use only one index when joining two tables and that in your case it should start with category_id.

You can also try adding affinity key to this index in various places, see if it helps further.

Regards,
--
Ilya Kasnacheev


ср, 19 сент. 2018 г. в 21:27, eugene miretsky <[hidden email]>:
Hi Ilya, 

I created 4 indexs on the table:
1) ga_pKey: on customer_id, dt, category_id (that's our primary key columns)
2) ga_customer_and_category_id: on customer_id and category_id
2) ga_customer_id: on customer_id
4) ga_category_id: on category_id


For the first query (category in ()), the execution plan when using the first 3 index is exactly the same  - using /* PUBLIC.AFFINITY_KEY */
When using #4 (alone or in combination with any of the other 3)
  1. /* PUBLIC.AFFINITY_KEY */ is replaced with  /* PUBLIC.GA_CATEGORY_ID: CATEGORY_ID IN(117930, 175930, 175940, 175945, 101450) */
  2. The query runs slower.
For the second query (join on an inlined table) the behaviour is very similar. Using the first 3 indexes results in the same plan - using  /* PUBLIC.AFFINITY_KEY */ and  /* function: CATEGORY_ID = GA__Z0.CATEGORY_ID */. 
When using #4 (alone or in combination with any of the other 3)
  1. /* function */ and /* PUBLIC.GA_CATEGORY_ID: CATEGORY_ID = CATS__Z1.CATEGORY_ID */ are used
  2. The query is much slower. 

Theoretically the query seems pretty simple
  1. Use affinity key  to make sure the query runs in parallel and there are no shuffles 
  2. Filter rows that match category_id using the category_id index
  3. Used customer_id index for the group_by (not sure if this step makes sense)
But I cannot get it to work.

Cheers,
Eugene




On Tue, Sep 18, 2018 at 10:56 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

I can see you try to use _key_PK as index. If your primary key is composite, it won't work properly for you. I recommend creating an explicit (category_id, customer_id) index.

Regards,
--
Ilya Kasnacheev


вт, 18 сент. 2018 г. в 17:47, eugene miretsky <[hidden email]>:
Hi Ilya, 

The different query result was my mistake - one of the categoy_ids was duplicate, so in the query that used join, it counted rows for that category twice. My apologies. 

However, we are still having an issue with query time, and the index not being applied to category_id. Would appreciate if you could take a look. 

Cheers,
Eugene

On Mon, Sep 17, 2018 at 9:15 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

Why don't you diff the results of those two queries, tell us what the difference is?

Regards,
--
Ilya Kasnacheev


пн, 17 сент. 2018 г. в 16:08, eugene miretsky <[hidden email]>:
Hello, 

Just wanted to see if anybody had time to look into this. 

Cheers,
Eugene

On Wed, Sep 12, 2018 at 6:29 PM eugene miretsky <[hidden email]> wrote:
Thanks! 

Tried joining with an inlined table instead of IN as per the second suggestion, and it didn't quite work. 

Query1: 
  • Select COUNT(*) FROM( Select customer_id from GATABLE3  use Index( ) where category_id in (9005, 175930, 175930, 175940,175945,101450, 6453) group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )
  • exec time = 17s
  • Result: 3105868
  • Same exec time if using AFFINITY_KEY index or "_key_PK_hash or customer_id index
  • Using an index on category_id increases the query time 33s
Query2: 
  • Select COUNT(*) FROM( Select customer_id from GATABLE3 ga  use index (PUBLIC."_key_PK") inner join table(category_id int = (9005, 175930, 175930, 175940,175945,101450, 6453)) cats on cats.category_id = ga.category_id   group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )
  • exec time = 38s
  • Result: 3113921
  • Same exec time if using AFFINITY_KEY index or "_key_PK_hash or customer_id index or category_id index
  • Using an index on category_id doesnt change the run time
Query plans are attached. 

3 questions:
  1. Why is the result differnt for the 2 queries - this is quite concerning. 
  2. Why is the 2nd query taking longer
  3. Why  category_id index doesn't work in case of query 2. 

On Wed, Sep 5, 2018 at 8:31 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

I don't think that we're able to use index with IN () clauses. Please convert it into OR clauses.


Regards,
--
Ilya Kasnacheev


пн, 3 сент. 2018 г. в 12:46, Andrey Mashenkov <[hidden email]>:
Hi

Actually, first query uses index on affinity key which looks more efficient than index on category_id column.
The first query can process groups one by one and stream partial results from map phase to reduce phase as it use sorted index lookup, 
while second query should process full dataset on map phase before pass it for reducing.

Try to use composite index (customer_id, category_id).

Also, SqlQueryFields.setCollocated(true) flag can help Ignite to build more efficient plan when group by on collocated column is used.

On Sun, Sep 2, 2018 at 2:02 AM eugene miretsky <[hidden email]> wrote:
Hello, 

Schema:
  • PUBLIC.GATABLE2.CUSTOMER_ID

    PUBLIC.GATABLE2.DT

    PUBLIC.GATABLE2.CATEGORY_ID

    PUBLIC.GATABLE2.VERTICAL_ID

    PUBLIC.GATABLE2.SERVICE

    PUBLIC.GATABLE2.PRODUCT_VIEWS_APP

    PUBLIC.GATABLE2.PRODUCT_CLICKS_APP

    PUBLIC.GATABLE2.PRODUCT_VIEWS_WEB

    PUBLIC.GATABLE2.PRODUCT_CLICKS_WEB

    PUBLIC.GATABLE2.PDP_SESSIONS_APP

    PUBLIC.GATABLE2.PDP_SESSIONS_WEB

  • pkey = customer_id,dt
  • affinityKey = customer
Query:
  • select COUNT(*) FROM( Select customer_id from GATABLE2 where category_id in (175925, 101450, 9005, 175930, 175930, 175940,175945,101450, 6453) group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )
The table has 600M rows. 
At first, the query took 1m, when we added an index on category_id the query started taking 3m. 

The SQL execution plan for both queries is attached. 

We are using a single x1.16xlarge insntace with query parallelism set to 32 

Cheers,
Eugene



--
Best regards,
Andrey V. Mashenkov
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Query 3x slower with index

Hello!

I guess that using AFFINITY_KEY as index have something to do with the fact that GROUP BY really wants to work per-partition.

I have the following query for you:

1: jdbc:ignite:thin://localhost> explain Select count(*) FROM( Select customer_id from (Select customer_id, product_views_app, product_clict_app from GA_DATA ga join table(category_id int = ( 117930, 175930, 175940,175945,101450)) cats on cats.category_id = ga.category_id) data group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clict_app) > 1);
PLAN  SELECT
    DATA__Z2.CUSTOMER_ID AS __C0_0,
    SUM(DATA__Z2.PRODUCT_VIEWS_APP) AS __C0_1,
    SUM(DATA__Z2.PRODUCT_CLICT_APP) AS __C0_2
FROM (
    SELECT
        GA__Z0.CUSTOMER_ID,
        GA__Z0.PRODUCT_VIEWS_APP,
        GA__Z0.PRODUCT_CLICT_APP
    FROM TABLE(CATEGORY_ID INTEGER=(117930, 175930, 175940, 175945, 101450)) CATS__Z1
    INNER JOIN PUBLIC.GA_DATA GA__Z0
        ON 1=1
    WHERE CATS__Z1.CATEGORY_ID = GA__Z0.CATEGORY_ID
) DATA__Z2
    /* SELECT
        GA__Z0.CUSTOMER_ID,
        GA__Z0.PRODUCT_VIEWS_APP,
        GA__Z0.PRODUCT_CLICT_APP
    FROM TABLE(CATEGORY_ID INTEGER=(117930, 175930, 175940, 175945, 101450)) CATS__Z1
        /++ function ++/
    INNER JOIN PUBLIC.GA_DATA GA__Z0
        /++ PUBLIC.GA_CATEGORY_ID: CATEGORY_ID = CATS__Z1.CATEGORY_ID ++/
        ON 1=1
    WHERE CATS__Z1.CATEGORY_ID = GA__Z0.CATEGORY_ID
     */
GROUP BY DATA__Z2.CUSTOMER_ID

PLAN  SELECT
    COUNT(*)
FROM (
    SELECT
        __C0_0 AS CUSTOMER_ID
    FROM PUBLIC.__T0
    GROUP BY __C0_0
    HAVING (SUM(__C0_1) > 2)
        OR (SUM(__C0_2) > 1)
) _18__Z3
    /* SELECT
        __C0_0 AS CUSTOMER_ID
    FROM PUBLIC.__T0
        /++ PUBLIC."merge_scan" ++/
    GROUP BY __C0_0
    HAVING (SUM(__C0_1) > 2)
        OR (SUM(__C0_2) > 1)
     */

However, I'm not sure it is "optimal" or not since I have no idea if it will perform better or worse on real data. That's why I need a subset of data which will make query execution speed readily visible. Unfortunately, I can't deduce that from query plan alone.

Regards,
--
Ilya Kasnacheev


пн, 24 сент. 2018 г. в 16:14, eugene miretsky <[hidden email]>:
An easy way to reproduce would be to 

1. Create table
CREATE TABLE GA_DATA (
    customer_id bigint,
    dt timestamp,
    category_id int,
    product_views_app int,
    product_clict_app int,
    product_clict_web int,
    product_clict_web int,
    PRIMARY KEY (customer_id, dt, category_id)
) WITH "template=ga_template, backups=0, affinityKey=customer_id";

2. Create indexes
  • CREATE INDEX ga_customer_id ON GA_Data (customer_id)
  • CREATE INDEX ga_pKey ON GA_Data (customer_id, dt, category_id)
  • CREATE INDEX ga_category_and_customer_id ON GA_Data (category_id, customer_id)
  • CREATE INDEX ga_category_id ON GA_Data (category_id)
3. Run Explain on the following queries while trying forcing using different indexes
  • Select count(*) FROM( 
Select customer_id from GA_DATA  use index (ga_category_id)
where category_id in (117930, 175930, 175940,175945,101450) 
group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )

  • Select count(*) FROM( 
    Select customer_id from GA_DATA ga use index (ga_pKey)
    join table(category_id int = ( 117930, 175930, 175940,175945,101450)) cats on cats.category_id = ga.category_id   
    group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 

The execution plans will be similar to what I have posted earler. In particular, only on of (a) affinty key index, (b) category_id index will be used.

On Fri, Sep 21, 2018 at 8:49 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

Can you share a reproducer project which loads (or generates) data for caches and then queries them? I could try and debug it if I had the reproducer.

Regards.
--
Ilya Kasnacheev


чт, 20 сент. 2018 г. в 21:05, eugene miretsky <[hidden email]>:
Thanks Ilya, 

Tried it, no luck. It performs the same as when using category_id index alone (slow).
  Any combindation I try either uses AFFINITY_KEY or category index. When it uses category index it runs slowers. 

Also, when AFFINITY_KEY key is used, the jobs runs on 32 threads (my query parallelism settings ) when category_id is used, the jobs runs on one thread most of the time (first few seconds it looks like more threads are doing work). 

Please help on this. It seems like a very simple use case (using affinity key and another index), either I am doing something extremly silly, or I stumbled on a bug in Ignite that's effecting a lot of people.

Cheers,
Eugene

On Thu, Sep 20, 2018 at 6:22 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

> 2) ga_customer_and_category_id: on customer_id and category_id

Have you tried to do an index on category_id first, customer_id second? Note that Ignite will use only one index when joining two tables and that in your case it should start with category_id.

You can also try adding affinity key to this index in various places, see if it helps further.

Regards,
--
Ilya Kasnacheev


ср, 19 сент. 2018 г. в 21:27, eugene miretsky <[hidden email]>:
Hi Ilya, 

I created 4 indexs on the table:
1) ga_pKey: on customer_id, dt, category_id (that's our primary key columns)
2) ga_customer_and_category_id: on customer_id and category_id
2) ga_customer_id: on customer_id
4) ga_category_id: on category_id


For the first query (category in ()), the execution plan when using the first 3 index is exactly the same  - using /* PUBLIC.AFFINITY_KEY */
When using #4 (alone or in combination with any of the other 3)
  1. /* PUBLIC.AFFINITY_KEY */ is replaced with  /* PUBLIC.GA_CATEGORY_ID: CATEGORY_ID IN(117930, 175930, 175940, 175945, 101450) */
  2. The query runs slower.
For the second query (join on an inlined table) the behaviour is very similar. Using the first 3 indexes results in the same plan - using  /* PUBLIC.AFFINITY_KEY */ and  /* function: CATEGORY_ID = GA__Z0.CATEGORY_ID */. 
When using #4 (alone or in combination with any of the other 3)
  1. /* function */ and /* PUBLIC.GA_CATEGORY_ID: CATEGORY_ID = CATS__Z1.CATEGORY_ID */ are used
  2. The query is much slower. 

Theoretically the query seems pretty simple
  1. Use affinity key  to make sure the query runs in parallel and there are no shuffles 
  2. Filter rows that match category_id using the category_id index
  3. Used customer_id index for the group_by (not sure if this step makes sense)
But I cannot get it to work.

Cheers,
Eugene




On Tue, Sep 18, 2018 at 10:56 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

I can see you try to use _key_PK as index. If your primary key is composite, it won't work properly for you. I recommend creating an explicit (category_id, customer_id) index.

Regards,
--
Ilya Kasnacheev


вт, 18 сент. 2018 г. в 17:47, eugene miretsky <[hidden email]>:
Hi Ilya, 

The different query result was my mistake - one of the categoy_ids was duplicate, so in the query that used join, it counted rows for that category twice. My apologies. 

However, we are still having an issue with query time, and the index not being applied to category_id. Would appreciate if you could take a look. 

Cheers,
Eugene

On Mon, Sep 17, 2018 at 9:15 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

Why don't you diff the results of those two queries, tell us what the difference is?

Regards,
--
Ilya Kasnacheev


пн, 17 сент. 2018 г. в 16:08, eugene miretsky <[hidden email]>:
Hello, 

Just wanted to see if anybody had time to look into this. 

Cheers,
Eugene

On Wed, Sep 12, 2018 at 6:29 PM eugene miretsky <[hidden email]> wrote:
Thanks! 

Tried joining with an inlined table instead of IN as per the second suggestion, and it didn't quite work. 

Query1: 
  • Select COUNT(*) FROM( Select customer_id from GATABLE3  use Index( ) where category_id in (9005, 175930, 175930, 175940,175945,101450, 6453) group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )
  • exec time = 17s
  • Result: 3105868
  • Same exec time if using AFFINITY_KEY index or "_key_PK_hash or customer_id index
  • Using an index on category_id increases the query time 33s
Query2: 
  • Select COUNT(*) FROM( Select customer_id from GATABLE3 ga  use index (PUBLIC."_key_PK") inner join table(category_id int = (9005, 175930, 175930, 175940,175945,101450, 6453)) cats on cats.category_id = ga.category_id   group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )
  • exec time = 38s
  • Result: 3113921
  • Same exec time if using AFFINITY_KEY index or "_key_PK_hash or customer_id index or category_id index
  • Using an index on category_id doesnt change the run time
Query plans are attached. 

3 questions:
  1. Why is the result differnt for the 2 queries - this is quite concerning. 
  2. Why is the 2nd query taking longer
  3. Why  category_id index doesn't work in case of query 2. 

On Wed, Sep 5, 2018 at 8:31 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

I don't think that we're able to use index with IN () clauses. Please convert it into OR clauses.


Regards,
--
Ilya Kasnacheev


пн, 3 сент. 2018 г. в 12:46, Andrey Mashenkov <[hidden email]>:
Hi

Actually, first query uses index on affinity key which looks more efficient than index on category_id column.
The first query can process groups one by one and stream partial results from map phase to reduce phase as it use sorted index lookup, 
while second query should process full dataset on map phase before pass it for reducing.

Try to use composite index (customer_id, category_id).

Also, SqlQueryFields.setCollocated(true) flag can help Ignite to build more efficient plan when group by on collocated column is used.

On Sun, Sep 2, 2018 at 2:02 AM eugene miretsky <[hidden email]> wrote:
Hello, 

Schema:
  • PUBLIC.GATABLE2.CUSTOMER_ID

    PUBLIC.GATABLE2.DT

    PUBLIC.GATABLE2.CATEGORY_ID

    PUBLIC.GATABLE2.VERTICAL_ID

    PUBLIC.GATABLE2.SERVICE

    PUBLIC.GATABLE2.PRODUCT_VIEWS_APP

    PUBLIC.GATABLE2.PRODUCT_CLICKS_APP

    PUBLIC.GATABLE2.PRODUCT_VIEWS_WEB

    PUBLIC.GATABLE2.PRODUCT_CLICKS_WEB

    PUBLIC.GATABLE2.PDP_SESSIONS_APP

    PUBLIC.GATABLE2.PDP_SESSIONS_WEB

  • pkey = customer_id,dt
  • affinityKey = customer
Query:
  • select COUNT(*) FROM( Select customer_id from GATABLE2 where category_id in (175925, 101450, 9005, 175930, 175930, 175940,175945,101450, 6453) group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )
The table has 600M rows. 
At first, the query took 1m, when we added an index on category_id the query started taking 3m. 

The SQL execution plan for both queries is attached. 

We are using a single x1.16xlarge insntace with query parallelism set to 32 

Cheers,
Eugene



--
Best regards,
Andrey V. Mashenkov
eugene miretsky eugene miretsky
Reply | Threaded
Open this post in threaded view
|

Re: Query 3x slower with index

Hi Ilya, 

I have tried it, and got the same performance as forcing using category index in my initial benchmark - query is 3x slowers and uses only one thread. 

From my experiments so far it seems like Ignite can either (a) use affinity key and run queries in parallel, (b) use index but run the query on only one thread. 

Has anybody been able to run OLAP like queries in while using an index? 

Cheers,
Eugene

On Mon, Sep 24, 2018 at 10:55 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

I guess that using AFFINITY_KEY as index have something to do with the fact that GROUP BY really wants to work per-partition.

I have the following query for you:

1: jdbc:ignite:thin://localhost> explain Select count(*) FROM( Select customer_id from (Select customer_id, product_views_app, product_clict_app from GA_DATA ga join table(category_id int = ( 117930, 175930, 175940,175945,101450)) cats on cats.category_id = ga.category_id) data group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clict_app) > 1);
PLAN  SELECT
    DATA__Z2.CUSTOMER_ID AS __C0_0,
    SUM(DATA__Z2.PRODUCT_VIEWS_APP) AS __C0_1,
    SUM(DATA__Z2.PRODUCT_CLICT_APP) AS __C0_2
FROM (
    SELECT
        GA__Z0.CUSTOMER_ID,
        GA__Z0.PRODUCT_VIEWS_APP,
        GA__Z0.PRODUCT_CLICT_APP
    FROM TABLE(CATEGORY_ID INTEGER=(117930, 175930, 175940, 175945, 101450)) CATS__Z1
    INNER JOIN PUBLIC.GA_DATA GA__Z0
        ON 1=1
    WHERE CATS__Z1.CATEGORY_ID = GA__Z0.CATEGORY_ID
) DATA__Z2
    /* SELECT
        GA__Z0.CUSTOMER_ID,
        GA__Z0.PRODUCT_VIEWS_APP,
        GA__Z0.PRODUCT_CLICT_APP
    FROM TABLE(CATEGORY_ID INTEGER=(117930, 175930, 175940, 175945, 101450)) CATS__Z1
        /++ function ++/
    INNER JOIN PUBLIC.GA_DATA GA__Z0
        /++ PUBLIC.GA_CATEGORY_ID: CATEGORY_ID = CATS__Z1.CATEGORY_ID ++/
        ON 1=1
    WHERE CATS__Z1.CATEGORY_ID = GA__Z0.CATEGORY_ID
     */
GROUP BY DATA__Z2.CUSTOMER_ID

PLAN  SELECT
    COUNT(*)
FROM (
    SELECT
        __C0_0 AS CUSTOMER_ID
    FROM PUBLIC.__T0
    GROUP BY __C0_0
    HAVING (SUM(__C0_1) > 2)
        OR (SUM(__C0_2) > 1)
) _18__Z3
    /* SELECT
        __C0_0 AS CUSTOMER_ID
    FROM PUBLIC.__T0
        /++ PUBLIC."merge_scan" ++/
    GROUP BY __C0_0
    HAVING (SUM(__C0_1) > 2)
        OR (SUM(__C0_2) > 1)
     */

However, I'm not sure it is "optimal" or not since I have no idea if it will perform better or worse on real data. That's why I need a subset of data which will make query execution speed readily visible. Unfortunately, I can't deduce that from query plan alone.

Regards,
--
Ilya Kasnacheev


пн, 24 сент. 2018 г. в 16:14, eugene miretsky <[hidden email]>:
An easy way to reproduce would be to 

1. Create table
CREATE TABLE GA_DATA (
    customer_id bigint,
    dt timestamp,
    category_id int,
    product_views_app int,
    product_clict_app int,
    product_clict_web int,
    product_clict_web int,
    PRIMARY KEY (customer_id, dt, category_id)
) WITH "template=ga_template, backups=0, affinityKey=customer_id";

2. Create indexes
  • CREATE INDEX ga_customer_id ON GA_Data (customer_id)
  • CREATE INDEX ga_pKey ON GA_Data (customer_id, dt, category_id)
  • CREATE INDEX ga_category_and_customer_id ON GA_Data (category_id, customer_id)
  • CREATE INDEX ga_category_id ON GA_Data (category_id)
3. Run Explain on the following queries while trying forcing using different indexes
  • Select count(*) FROM( 
Select customer_id from GA_DATA  use index (ga_category_id)
where category_id in (117930, 175930, 175940,175945,101450) 
group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )

  • Select count(*) FROM( 
    Select customer_id from GA_DATA ga use index (ga_pKey)
    join table(category_id int = ( 117930, 175930, 175940,175945,101450)) cats on cats.category_id = ga.category_id   
    group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 

The execution plans will be similar to what I have posted earler. In particular, only on of (a) affinty key index, (b) category_id index will be used.

On Fri, Sep 21, 2018 at 8:49 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

Can you share a reproducer project which loads (or generates) data for caches and then queries them? I could try and debug it if I had the reproducer.

Regards.
--
Ilya Kasnacheev


чт, 20 сент. 2018 г. в 21:05, eugene miretsky <[hidden email]>:
Thanks Ilya, 

Tried it, no luck. It performs the same as when using category_id index alone (slow).
  Any combindation I try either uses AFFINITY_KEY or category index. When it uses category index it runs slowers. 

Also, when AFFINITY_KEY key is used, the jobs runs on 32 threads (my query parallelism settings ) when category_id is used, the jobs runs on one thread most of the time (first few seconds it looks like more threads are doing work). 

Please help on this. It seems like a very simple use case (using affinity key and another index), either I am doing something extremly silly, or I stumbled on a bug in Ignite that's effecting a lot of people.

Cheers,
Eugene

On Thu, Sep 20, 2018 at 6:22 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

> 2) ga_customer_and_category_id: on customer_id and category_id

Have you tried to do an index on category_id first, customer_id second? Note that Ignite will use only one index when joining two tables and that in your case it should start with category_id.

You can also try adding affinity key to this index in various places, see if it helps further.

Regards,
--
Ilya Kasnacheev


ср, 19 сент. 2018 г. в 21:27, eugene miretsky <[hidden email]>:
Hi Ilya, 

I created 4 indexs on the table:
1) ga_pKey: on customer_id, dt, category_id (that's our primary key columns)
2) ga_customer_and_category_id: on customer_id and category_id
2) ga_customer_id: on customer_id
4) ga_category_id: on category_id


For the first query (category in ()), the execution plan when using the first 3 index is exactly the same  - using /* PUBLIC.AFFINITY_KEY */
When using #4 (alone or in combination with any of the other 3)
  1. /* PUBLIC.AFFINITY_KEY */ is replaced with  /* PUBLIC.GA_CATEGORY_ID: CATEGORY_ID IN(117930, 175930, 175940, 175945, 101450) */
  2. The query runs slower.
For the second query (join on an inlined table) the behaviour is very similar. Using the first 3 indexes results in the same plan - using  /* PUBLIC.AFFINITY_KEY */ and  /* function: CATEGORY_ID = GA__Z0.CATEGORY_ID */. 
When using #4 (alone or in combination with any of the other 3)
  1. /* function */ and /* PUBLIC.GA_CATEGORY_ID: CATEGORY_ID = CATS__Z1.CATEGORY_ID */ are used
  2. The query is much slower. 

Theoretically the query seems pretty simple
  1. Use affinity key  to make sure the query runs in parallel and there are no shuffles 
  2. Filter rows that match category_id using the category_id index
  3. Used customer_id index for the group_by (not sure if this step makes sense)
But I cannot get it to work.

Cheers,
Eugene




On Tue, Sep 18, 2018 at 10:56 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

I can see you try to use _key_PK as index. If your primary key is composite, it won't work properly for you. I recommend creating an explicit (category_id, customer_id) index.

Regards,
--
Ilya Kasnacheev


вт, 18 сент. 2018 г. в 17:47, eugene miretsky <[hidden email]>:
Hi Ilya, 

The different query result was my mistake - one of the categoy_ids was duplicate, so in the query that used join, it counted rows for that category twice. My apologies. 

However, we are still having an issue with query time, and the index not being applied to category_id. Would appreciate if you could take a look. 

Cheers,
Eugene

On Mon, Sep 17, 2018 at 9:15 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

Why don't you diff the results of those two queries, tell us what the difference is?

Regards,
--
Ilya Kasnacheev


пн, 17 сент. 2018 г. в 16:08, eugene miretsky <[hidden email]>:
Hello, 

Just wanted to see if anybody had time to look into this. 

Cheers,
Eugene

On Wed, Sep 12, 2018 at 6:29 PM eugene miretsky <[hidden email]> wrote:
Thanks! 

Tried joining with an inlined table instead of IN as per the second suggestion, and it didn't quite work. 

Query1: 
  • Select COUNT(*) FROM( Select customer_id from GATABLE3  use Index( ) where category_id in (9005, 175930, 175930, 175940,175945,101450, 6453) group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )
  • exec time = 17s
  • Result: 3105868
  • Same exec time if using AFFINITY_KEY index or "_key_PK_hash or customer_id index
  • Using an index on category_id increases the query time 33s
Query2: 
  • Select COUNT(*) FROM( Select customer_id from GATABLE3 ga  use index (PUBLIC."_key_PK") inner join table(category_id int = (9005, 175930, 175930, 175940,175945,101450, 6453)) cats on cats.category_id = ga.category_id   group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )
  • exec time = 38s
  • Result: 3113921
  • Same exec time if using AFFINITY_KEY index or "_key_PK_hash or customer_id index or category_id index
  • Using an index on category_id doesnt change the run time
Query plans are attached. 

3 questions:
  1. Why is the result differnt for the 2 queries - this is quite concerning. 
  2. Why is the 2nd query taking longer
  3. Why  category_id index doesn't work in case of query 2. 

On Wed, Sep 5, 2018 at 8:31 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

I don't think that we're able to use index with IN () clauses. Please convert it into OR clauses.


Regards,
--
Ilya Kasnacheev


пн, 3 сент. 2018 г. в 12:46, Andrey Mashenkov <[hidden email]>:
Hi

Actually, first query uses index on affinity key which looks more efficient than index on category_id column.
The first query can process groups one by one and stream partial results from map phase to reduce phase as it use sorted index lookup, 
while second query should process full dataset on map phase before pass it for reducing.

Try to use composite index (customer_id, category_id).

Also, SqlQueryFields.setCollocated(true) flag can help Ignite to build more efficient plan when group by on collocated column is used.

On Sun, Sep 2, 2018 at 2:02 AM eugene miretsky <[hidden email]> wrote:
Hello, 

Schema:
  • PUBLIC.GATABLE2.CUSTOMER_ID

    PUBLIC.GATABLE2.DT

    PUBLIC.GATABLE2.CATEGORY_ID

    PUBLIC.GATABLE2.VERTICAL_ID

    PUBLIC.GATABLE2.SERVICE

    PUBLIC.GATABLE2.PRODUCT_VIEWS_APP

    PUBLIC.GATABLE2.PRODUCT_CLICKS_APP

    PUBLIC.GATABLE2.PRODUCT_VIEWS_WEB

    PUBLIC.GATABLE2.PRODUCT_CLICKS_WEB

    PUBLIC.GATABLE2.PDP_SESSIONS_APP

    PUBLIC.GATABLE2.PDP_SESSIONS_WEB

  • pkey = customer_id,dt
  • affinityKey = customer
Query:
  • select COUNT(*) FROM( Select customer_id from GATABLE2 where category_id in (175925, 101450, 9005, 175930, 175930, 175940,175945,101450, 6453) group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )
The table has 600M rows. 
At first, the query took 1m, when we added an index on category_id the query started taking 3m. 

The SQL execution plan for both queries is attached. 

We are using a single x1.16xlarge insntace with query parallelism set to 32 

Cheers,
Eugene



--
Best regards,
Andrey V. Mashenkov
Stanislav Lukyanov Stanislav Lukyanov
Reply | Threaded
Open this post in threaded view
|

RE: Query 3x slower with index

Hi,

 

It is a rather lengthy thread and I can’t dive into details right now,

but AFAICS the issue now is making affinity key index to work with a secondary index.

The important things to understand is

  1. Ignite will only use one index per table
  2. In case of a composite index, it will apply the columns one by one
  3. The affinity key index should always go first as the first step is splitting the query by affinity key values

 

So, to use index over the affinity key (customer_id) and a secondary index (category_id) one needs to create an index

like (customer_id, category_id), in that order, with no columns in between.

Note that index (customer_id, dt, category_id) can’t be used instead of it.

On the other hand, (customer_id, category_id, dt) can - the last part of the index will be left unused.

 

Thanks,

Stan

 

From: [hidden email]
Sent: 9 октября 2018 г. 19:40
To: [hidden email]
Subject: Re: Query 3x slower with index

 

Hi Ilya, 

 

I have tried it, and got the same performance as forcing using category index in my initial benchmark - query is 3x slowers and uses only one thread. 

 

From my experiments so far it seems like Ignite can either (a) use affinity key and run queries in parallel, (b) use index but run the query on only one thread. 

 

Has anybody been able to run OLAP like queries in while using an index? 

 

Cheers,

Eugene

 

On Mon, Sep 24, 2018 at 10:55 AM Ilya Kasnacheev <[hidden email]> wrote:

Hello!

 

I guess that using AFFINITY_KEY as index have something to do with the fact that GROUP BY really wants to work per-partition.

 

I have the following query for you:

 

1: jdbc:ignite:thin://localhost> explain Select count(*) FROM( Select customer_id from (Select customer_id, product_views_app, product_clict_app from GA_DATA ga join table(category_id int = ( 117930, 175930, 175940,175945,101450)) cats on cats.category_id = ga.category_id) data group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clict_app) > 1);
PLAN  SELECT
    DATA__Z2.CUSTOMER_ID AS __C0_0,
    SUM(DATA__Z2.PRODUCT_VIEWS_APP) AS __C0_1,
    SUM(DATA__Z2.PRODUCT_CLICT_APP) AS __C0_2
FROM (
    SELECT
        GA__Z0.CUSTOMER_ID,
        GA__Z0.PRODUCT_VIEWS_APP,
        GA__Z0.PRODUCT_CLICT_APP
    FROM TABLE(CATEGORY_ID INTEGER=(117930, 175930, 175940, 175945, 101450)) CATS__Z1
    INNER JOIN PUBLIC.GA_DATA GA__Z0
        ON 1=1
    WHERE CATS__Z1.CATEGORY_ID = GA__Z0.CATEGORY_ID
) DATA__Z2
    /* SELECT
        GA__Z0.CUSTOMER_ID,
        GA__Z0.PRODUCT_VIEWS_APP,
        GA__Z0.PRODUCT_CLICT_APP
    FROM TABLE(CATEGORY_ID INTEGER=(117930, 175930, 175940, 175945, 101450)) CATS__Z1
        /++ function ++/
    INNER JOIN PUBLIC.GA_DATA GA__Z0
        /++ PUBLIC.GA_CATEGORY_ID: CATEGORY_ID = CATS__Z1.CATEGORY_ID ++/
        ON 1=1
    WHERE CATS__Z1.CATEGORY_ID = GA__Z0.CATEGORY_ID
     */
GROUP BY DATA__Z2.CUSTOMER_ID

PLAN  SELECT
    COUNT(*)
FROM (
    SELECT
        __C0_0 AS CUSTOMER_ID
    FROM PUBLIC.__T0
    GROUP BY __C0_0
    HAVING (SUM(__C0_1) > 2)
        OR (SUM(__C0_2) > 1)
) _18__Z3
    /* SELECT
        __C0_0 AS CUSTOMER_ID
    FROM PUBLIC.__T0
        /++ PUBLIC."merge_scan" ++/
    GROUP BY __C0_0
    HAVING (SUM(__C0_1) > 2)
        OR (SUM(__C0_2) > 1)
     */

 

However, I'm not sure it is "optimal" or not since I have no idea if it will perform better or worse on real data. That's why I need a subset of data which will make query execution speed readily visible. Unfortunately, I can't deduce that from query plan alone.

 

Regards,

--

Ilya Kasnacheev

 

 

пн, 24 сент. 2018 г. в 16:14, eugene miretsky <[hidden email]>:

An easy way to reproduce would be to 

 

1. Create table

CREATE TABLE GA_DATA (

    customer_id bigint,

    dt timestamp,

    category_id int,

    product_views_app int,

    product_clict_app int,

    product_clict_web int,

    product_clict_web int,

    PRIMARY KEY (customer_id, dt, category_id)

) WITH "template=ga_template, backups=0, affinityKey=customer_id";

 

2. Create indexes

·         CREATE INDEX ga_customer_id ON GA_Data (customer_id)

·         CREATE INDEX ga_pKey ON GA_Data (customer_id, dt, category_id)

·         CREATE INDEX ga_category_and_customer_id ON GA_Data (category_id, customer_id)

·         CREATE INDEX ga_category_id ON GA_Data (category_id)

3. Run Explain on the following queries while trying forcing using different indexes

·         Select count(*) FROM( 

Select customer_id from GA_DATA  use index (ga_category_id)

where category_id in (117930, 175930, 175940,175945,101450) 

group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )

 

·         Select count(*) FROM( 

    Select customer_id from GA_DATA ga use index (ga_pKey)

    join table(category_id int = ( 117930, 175930, 175940,175945,101450)) cats on cats.category_id = ga.category_id   

    group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 

 

The execution plans will be similar to what I have posted earler. In particular, only on of (a) affinty key index, (b) category_id index will be used.

 

On Fri, Sep 21, 2018 at 8:49 AM Ilya Kasnacheev <[hidden email]> wrote:

Hello!

 

Can you share a reproducer project which loads (or generates) data for caches and then queries them? I could try and debug it if I had the reproducer.

 

Regards.

--

Ilya Kasnacheev

 

 

чт, 20 сент. 2018 г. в 21:05, eugene miretsky <[hidden email]>:

Thanks Ilya, 

 

Tried it, no luck. It performs the same as when using category_id index alone (slow).

  Any combindation I try either uses AFFINITY_KEY or category index. When it uses category index it runs slowers. 

 

Also, when AFFINITY_KEY key is used, the jobs runs on 32 threads (my query parallelism settings ) when category_id is used, the jobs runs on one thread most of the time (first few seconds it looks like more threads are doing work). 

 

Please help on this. It seems like a very simple use case (using affinity key and another index), either I am doing something extremly silly, or I stumbled on a bug in Ignite that's effecting a lot of people.

 

Cheers,

Eugene

 

On Thu, Sep 20, 2018 at 6:22 AM Ilya Kasnacheev <[hidden email]> wrote:

Hello!

 

> 2) ga_customer_and_category_id: on customer_id and category_id

 

Have you tried to do an index on category_id first, customer_id second? Note that Ignite will use only one index when joining two tables and that in your case it should start with category_id.

 

You can also try adding affinity key to this index in various places, see if it helps further.

 

Regards,

--

Ilya Kasnacheev

 

 

ср, 19 сент. 2018 г. в 21:27, eugene miretsky <[hidden email]>:

Hi Ilya, 

 

I created 4 indexs on the table:

1) ga_pKey: on customer_id, dt, category_id (that's our primary key columns)

2) ga_customer_and_category_id: on customer_id and category_id

2) ga_customer_id: on customer_id

4) ga_category_id: on category_id

 

 

For the first query (category in ()), the execution plan when using the first 3 index is exactly the same  - using /* PUBLIC.AFFINITY_KEY */

When using #4 (alone or in combination with any of the other 3)

1.       /* PUBLIC.AFFINITY_KEY */ is replaced with  /* PUBLIC.GA_CATEGORY_ID: CATEGORY_ID IN(117930, 175930, 175940, 175945, 101450) */

2.       The query runs slower.

For the second query (join on an inlined table) the behaviour is very similar. Using the first 3 indexes results in the same plan - using  /* PUBLIC.AFFINITY_KEY */ and  /* function: CATEGORY_ID = GA__Z0.CATEGORY_ID */. 

When using #4 (alone or in combination with any of the other 3)

1.       /* function */ and /* PUBLIC.GA_CATEGORY_ID: CATEGORY_ID = CATS__Z1.CATEGORY_ID */ are used

2.       The query is much slower. 

 

Theoretically the query seems pretty simple

1.       Use affinity key  to make sure the query runs in parallel and there are no shuffles 

2.       Filter rows that match category_id using the category_id index

3.       Used customer_id index for the group_by (not sure if this step makes sense)

But I cannot get it to work.

 

Cheers,

Eugene

 

 

 

 

On Tue, Sep 18, 2018 at 10:56 AM Ilya Kasnacheev <[hidden email]> wrote:

Hello!

 

I can see you try to use _key_PK as index. If your primary key is composite, it won't work properly for you. I recommend creating an explicit (category_id, customer_id) index.

 

Regards,

--

Ilya Kasnacheev

 

 

вт, 18 сент. 2018 г. в 17:47, eugene miretsky <[hidden email]>:

Hi Ilya, 

 

The different query result was my mistake - one of the categoy_ids was duplicate, so in the query that used join, it counted rows for that category twice. My apologies. 

 

However, we are still having an issue with query time, and the index not being applied to category_id. Would appreciate if you could take a look. 

 

Cheers,

Eugene

 

On Mon, Sep 17, 2018 at 9:15 AM Ilya Kasnacheev <[hidden email]> wrote:

Hello!

 

Why don't you diff the results of those two queries, tell us what the difference is?

 

Regards,

--

Ilya Kasnacheev

 

 

пн, 17 сент. 2018 г. в 16:08, eugene miretsky <[hidden email]>:

Hello, 

 

Just wanted to see if anybody had time to look into this. 

 

Cheers,

Eugene

 

On Wed, Sep 12, 2018 at 6:29 PM eugene miretsky <[hidden email]> wrote:

Thanks! 

 

Tried joining with an inlined table instead of IN as per the second suggestion, and it didn't quite work. 

 

Query1: 

·         Select COUNT(*) FROM( Select customer_id from GATABLE3  use Index( ) where category_id in (9005, 175930, 175930, 175940,175945,101450, 6453) group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )

·         exec time = 17s

·         Result: 3105868

·         Same exec time if using AFFINITY_KEY index or "_key_PK_hash or customer_id index

·         Using an index on category_id increases the query time 33s

Query2: 

·         Select COUNT(*) FROM( Select customer_id from GATABLE3 ga  use index (PUBLIC."_key_PK") inner join table(category_id int = (9005, 175930, 175930, 175940,175945,101450, 6453)) cats on cats.category_id = ga.category_id   group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )

·         exec time = 38s

·         Result: 3113921

·         Same exec time if using AFFINITY_KEY index or "_key_PK_hash or customer_id index or category_id index

·         Using an index on category_id doesnt change the run time

Query plans are attached. 

 

3 questions:

1.       Why is the result differnt for the 2 queries - this is quite concerning. 

2.       Why is the 2nd query taking longer

3.       Why  category_id index doesn't work in case of query 2. 

 

On Wed, Sep 5, 2018 at 8:31 AM Ilya Kasnacheev <[hidden email]> wrote:

Hello!

 

I don't think that we're able to use index with IN () clauses. Please convert it into OR clauses.

 

 

Regards,

--

Ilya Kasnacheev

 

 

пн, 3 сент. 2018 г. в 12:46, Andrey Mashenkov <[hidden email]>:

Hi

 

Actually, first query uses index on affinity key which looks more efficient than index on category_id column.

The first query can process groups one by one and stream partial results from map phase to reduce phase as it use sorted index lookup, 

while second query should process full dataset on map phase before pass it for reducing.

 

Try to use composite index (customer_id, category_id).

 

Also, SqlQueryFields.setCollocated(true) flag can help Ignite to build more efficient plan when group by on collocated column is used.

 

On Sun, Sep 2, 2018 at 2:02 AM eugene miretsky <[hidden email]> wrote:

Hello, 

 

Schema:

·          

PUBLIC.GATABLE2.CUSTOMER_ID

PUBLIC.GATABLE2.DT

PUBLIC.GATABLE2.CATEGORY_ID

PUBLIC.GATABLE2.VERTICAL_ID

PUBLIC.GATABLE2.SERVICE

PUBLIC.GATABLE2.PRODUCT_VIEWS_APP

PUBLIC.GATABLE2.PRODUCT_CLICKS_APP

PUBLIC.GATABLE2.PRODUCT_VIEWS_WEB

PUBLIC.GATABLE2.PRODUCT_CLICKS_WEB

PUBLIC.GATABLE2.PDP_SESSIONS_APP

PUBLIC.GATABLE2.PDP_SESSIONS_WEB

·         pkey = customer_id,dt

·         affinityKey = customer

Query:

·         select COUNT(*) FROM( Select customer_id from GATABLE2 where category_id in (175925, 101450, 9005, 175930, 175930, 175940,175945,101450, 6453) group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )

The table has 600M rows. 

At first, the query took 1m, when we added an index on category_id the query started taking 3m. 

 

The SQL execution plan for both queries is attached. 

 

We are using a single x1.16xlarge insntace with query parallelism set to 32 

 

Cheers,

Eugene

 


 

--

Best regards,
Andrey V. Mashenkov

 

eugene miretsky eugene miretsky
Reply | Threaded
Open this post in threaded view
|

Re: Query 3x slower with index

Thanks! 

Could you please clarfiy "In case of a composite index, it will apply the columns one by one"? 

Igntie (or rather H2?) needs to load the data into heap in order to do the groupBy & aggregations. We were hoping that only data that matches the category filter will be loaded. 
What does one by one mean when: (assuming and index (customer_id, category_id)
  1. The fiilter is on both customer  and category. What data will be loaded into Heap?
  2. The fitler is only on category, and the customer is just used for groupBy. Will Ignite
    1.  load one customer with all the rows, and apply the category filter in heap
    2.  load one customer, but load only the rows that pass the category fitler in heap
    3. load all the events that pass the category filter, and then group them by customer. 
From out benchmarking so far it seems like 1 is happening. 

On Thu, Oct 11, 2018 at 1:28 PM Stanislav Lukyanov <[hidden email]> wrote:

Hi,

 

It is a rather lengthy thread and I can’t dive into details right now,

but AFAICS the issue now is making affinity key index to work with a secondary index.

The important things to understand is

  1. Ignite will only use one index per table
  2. In case of a composite index, it will apply the columns one by one
  3. The affinity key index should always go first as the first step is splitting the query by affinity key values

 

So, to use index over the affinity key (customer_id) and a secondary index (category_id) one needs to create an index

like (customer_id, category_id), in that order, with no columns in between.

Note that index (customer_id, dt, category_id) can’t be used instead of it.

On the other hand, (customer_id, category_id, dt) can - the last part of the index will be left unused.

 

Thanks,

Stan

 

From: [hidden email]
Sent: 9 октября 2018 г. 19:40
To: [hidden email]
Subject: Re: Query 3x slower with index

 

Hi Ilya, 

 

I have tried it, and got the same performance as forcing using category index in my initial benchmark - query is 3x slowers and uses only one thread. 

 

From my experiments so far it seems like Ignite can either (a) use affinity key and run queries in parallel, (b) use index but run the query on only one thread. 

 

Has anybody been able to run OLAP like queries in while using an index? 

 

Cheers,

Eugene

 

On Mon, Sep 24, 2018 at 10:55 AM Ilya Kasnacheev <[hidden email]> wrote:

Hello!

 

I guess that using AFFINITY_KEY as index have something to do with the fact that GROUP BY really wants to work per-partition.

 

I have the following query for you:

 

1: jdbc:ignite:thin://localhost> explain Select count(*) FROM( Select customer_id from (Select customer_id, product_views_app, product_clict_app from GA_DATA ga join table(category_id int = ( 117930, 175930, 175940,175945,101450)) cats on cats.category_id = ga.category_id) data group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clict_app) > 1);
PLAN  SELECT
    DATA__Z2.CUSTOMER_ID AS __C0_0,
    SUM(DATA__Z2.PRODUCT_VIEWS_APP) AS __C0_1,
    SUM(DATA__Z2.PRODUCT_CLICT_APP) AS __C0_2
FROM (
    SELECT
        GA__Z0.CUSTOMER_ID,
        GA__Z0.PRODUCT_VIEWS_APP,
        GA__Z0.PRODUCT_CLICT_APP
    FROM TABLE(CATEGORY_ID INTEGER=(117930, 175930, 175940, 175945, 101450)) CATS__Z1
    INNER JOIN PUBLIC.GA_DATA GA__Z0
        ON 1=1
    WHERE CATS__Z1.CATEGORY_ID = GA__Z0.CATEGORY_ID
) DATA__Z2
    /* SELECT
        GA__Z0.CUSTOMER_ID,
        GA__Z0.PRODUCT_VIEWS_APP,
        GA__Z0.PRODUCT_CLICT_APP
    FROM TABLE(CATEGORY_ID INTEGER=(117930, 175930, 175940, 175945, 101450)) CATS__Z1
        /++ function ++/
    INNER JOIN PUBLIC.GA_DATA GA__Z0
        /++ PUBLIC.GA_CATEGORY_ID: CATEGORY_ID = CATS__Z1.CATEGORY_ID ++/
        ON 1=1
    WHERE CATS__Z1.CATEGORY_ID = GA__Z0.CATEGORY_ID
     */
GROUP BY DATA__Z2.CUSTOMER_ID

PLAN  SELECT
    COUNT(*)
FROM (
    SELECT
        __C0_0 AS CUSTOMER_ID
    FROM PUBLIC.__T0
    GROUP BY __C0_0
    HAVING (SUM(__C0_1) > 2)
        OR (SUM(__C0_2) > 1)
) _18__Z3
    /* SELECT
        __C0_0 AS CUSTOMER_ID
    FROM PUBLIC.__T0
        /++ PUBLIC."merge_scan" ++/
    GROUP BY __C0_0
    HAVING (SUM(__C0_1) > 2)
        OR (SUM(__C0_2) > 1)
     */

 

However, I'm not sure it is "optimal" or not since I have no idea if it will perform better or worse on real data. That's why I need a subset of data which will make query execution speed readily visible. Unfortunately, I can't deduce that from query plan alone.

 

Regards,

--

Ilya Kasnacheev

 

 

пн, 24 сент. 2018 г. в 16:14, eugene miretsky <[hidden email]>:

An easy way to reproduce would be to 

 

1. Create table

CREATE TABLE GA_DATA (

    customer_id bigint,

    dt timestamp,

    category_id int,

    product_views_app int,

    product_clict_app int,

    product_clict_web int,

    product_clict_web int,

    PRIMARY KEY (customer_id, dt, category_id)

) WITH "template=ga_template, backups=0, affinityKey=customer_id";

 

2. Create indexes

·         CREATE INDEX ga_customer_id ON GA_Data (customer_id)

·         CREATE INDEX ga_pKey ON GA_Data (customer_id, dt, category_id)

·         CREATE INDEX ga_category_and_customer_id ON GA_Data (category_id, customer_id)

·         CREATE INDEX ga_category_id ON GA_Data (category_id)

3. Run Explain on the following queries while trying forcing using different indexes

·         Select count(*) FROM( 

Select customer_id from GA_DATA  use index (ga_category_id)

where category_id in (117930, 175930, 175940,175945,101450) 

group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )

 

·         Select count(*) FROM( 

    Select customer_id from GA_DATA ga use index (ga_pKey)

    join table(category_id int = ( 117930, 175930, 175940,175945,101450)) cats on cats.category_id = ga.category_id   

    group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 

 

The execution plans will be similar to what I have posted earler. In particular, only on of (a) affinty key index, (b) category_id index will be used.

 

On Fri, Sep 21, 2018 at 8:49 AM Ilya Kasnacheev <[hidden email]> wrote:

Hello!

 

Can you share a reproducer project which loads (or generates) data for caches and then queries them? I could try and debug it if I had the reproducer.

 

Regards.

--

Ilya Kasnacheev

 

 

чт, 20 сент. 2018 г. в 21:05, eugene miretsky <[hidden email]>:

Thanks Ilya, 

 

Tried it, no luck. It performs the same as when using category_id index alone (slow).

  Any combindation I try either uses AFFINITY_KEY or category index. When it uses category index it runs slowers. 

 

Also, when AFFINITY_KEY key is used, the jobs runs on 32 threads (my query parallelism settings ) when category_id is used, the jobs runs on one thread most of the time (first few seconds it looks like more threads are doing work). 

 

Please help on this. It seems like a very simple use case (using affinity key and another index), either I am doing something extremly silly, or I stumbled on a bug in Ignite that's effecting a lot of people.

 

Cheers,

Eugene

 

On Thu, Sep 20, 2018 at 6:22 AM Ilya Kasnacheev <[hidden email]> wrote:

Hello!

 

> 2) ga_customer_and_category_id: on customer_id and category_id

 

Have you tried to do an index on category_id first, customer_id second? Note that Ignite will use only one index when joining two tables and that in your case it should start with category_id.

 

You can also try adding affinity key to this index in various places, see if it helps further.

 

Regards,

--

Ilya Kasnacheev

 

 

ср, 19 сент. 2018 г. в 21:27, eugene miretsky <[hidden email]>:

Hi Ilya, 

 

I created 4 indexs on the table:

1) ga_pKey: on customer_id, dt, category_id (that's our primary key columns)

2) ga_customer_and_category_id: on customer_id and category_id

2) ga_customer_id: on customer_id

4) ga_category_id: on category_id

 

 

For the first query (category in ()), the execution plan when using the first 3 index is exactly the same  - using /* PUBLIC.AFFINITY_KEY */

When using #4 (alone or in combination with any of the other 3)

1.       /* PUBLIC.AFFINITY_KEY */ is replaced with  /* PUBLIC.GA_CATEGORY_ID: CATEGORY_ID IN(117930, 175930, 175940, 175945, 101450) */

2.       The query runs slower.

For the second query (join on an inlined table) the behaviour is very similar. Using the first 3 indexes results in the same plan - using  /* PUBLIC.AFFINITY_KEY */ and  /* function: CATEGORY_ID = GA__Z0.CATEGORY_ID */. 

When using #4 (alone or in combination with any of the other 3)

1.       /* function */ and /* PUBLIC.GA_CATEGORY_ID: CATEGORY_ID = CATS__Z1.CATEGORY_ID */ are used

2.       The query is much slower. 

 

Theoretically the query seems pretty simple

1.       Use affinity key  to make sure the query runs in parallel and there are no shuffles 

2.       Filter rows that match category_id using the category_id index

3.       Used customer_id index for the group_by (not sure if this step makes sense)

But I cannot get it to work.

 

Cheers,

Eugene

 

 

 

 

On Tue, Sep 18, 2018 at 10:56 AM Ilya Kasnacheev <[hidden email]> wrote:

Hello!

 

I can see you try to use _key_PK as index. If your primary key is composite, it won't work properly for you. I recommend creating an explicit (category_id, customer_id) index.

 

Regards,

--

Ilya Kasnacheev

 

 

вт, 18 сент. 2018 г. в 17:47, eugene miretsky <[hidden email]>:

Hi Ilya, 

 

The different query result was my mistake - one of the categoy_ids was duplicate, so in the query that used join, it counted rows for that category twice. My apologies. 

 

However, we are still having an issue with query time, and the index not being applied to category_id. Would appreciate if you could take a look. 

 

Cheers,

Eugene

 

On Mon, Sep 17, 2018 at 9:15 AM Ilya Kasnacheev <[hidden email]> wrote:

Hello!

 

Why don't you diff the results of those two queries, tell us what the difference is?

 

Regards,

--

Ilya Kasnacheev

 

 

пн, 17 сент. 2018 г. в 16:08, eugene miretsky <[hidden email]>:

Hello, 

 

Just wanted to see if anybody had time to look into this. 

 

Cheers,

Eugene

 

On Wed, Sep 12, 2018 at 6:29 PM eugene miretsky <[hidden email]> wrote:

Thanks! 

 

Tried joining with an inlined table instead of IN as per the second suggestion, and it didn't quite work. 

 

Query1: 

·         Select COUNT(*) FROM( Select customer_id from GATABLE3  use Index( ) where category_id in (9005, 175930, 175930, 175940,175945,101450, 6453) group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )

·         exec time = 17s

·         Result: 3105868

·         Same exec time if using AFFINITY_KEY index or "_key_PK_hash or customer_id index

·         Using an index on category_id increases the query time 33s

Query2: 

·         Select COUNT(*) FROM( Select customer_id from GATABLE3 ga  use index (PUBLIC."_key_PK") inner join table(category_id int = (9005, 175930, 175930, 175940,175945,101450, 6453)) cats on cats.category_id = ga.category_id   group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )

·         exec time = 38s

·         Result: 3113921

·         Same exec time if using AFFINITY_KEY index or "_key_PK_hash or customer_id index or category_id index

·         Using an index on category_id doesnt change the run time

Query plans are attached. 

 

3 questions:

1.       Why is the result differnt for the 2 queries - this is quite concerning. 

2.       Why is the 2nd query taking longer

3.       Why  category_id index doesn't work in case of query 2. 

 

On Wed, Sep 5, 2018 at 8:31 AM Ilya Kasnacheev <[hidden email]> wrote:

Hello!

 

I don't think that we're able to use index with IN () clauses. Please convert it into OR clauses.

 

 

Regards,

--

Ilya Kasnacheev

 

 

пн, 3 сент. 2018 г. в 12:46, Andrey Mashenkov <[hidden email]>:

Hi

 

Actually, first query uses index on affinity key which looks more efficient than index on category_id column.

The first query can process groups one by one and stream partial results from map phase to reduce phase as it use sorted index lookup, 

while second query should process full dataset on map phase before pass it for reducing.

 

Try to use composite index (customer_id, category_id).

 

Also, SqlQueryFields.setCollocated(true) flag can help Ignite to build more efficient plan when group by on collocated column is used.

 

On Sun, Sep 2, 2018 at 2:02 AM eugene miretsky <[hidden email]> wrote:

Hello, 

 

Schema:

·          

PUBLIC.GATABLE2.CUSTOMER_ID

PUBLIC.GATABLE2.DT

PUBLIC.GATABLE2.CATEGORY_ID

PUBLIC.GATABLE2.VERTICAL_ID

PUBLIC.GATABLE2.SERVICE

PUBLIC.GATABLE2.PRODUCT_VIEWS_APP

PUBLIC.GATABLE2.PRODUCT_CLICKS_APP

PUBLIC.GATABLE2.PRODUCT_VIEWS_WEB

PUBLIC.GATABLE2.PRODUCT_CLICKS_WEB

PUBLIC.GATABLE2.PDP_SESSIONS_APP

PUBLIC.GATABLE2.PDP_SESSIONS_WEB

·         pkey = customer_id,dt

·         affinityKey = customer

Query:

·         select COUNT(*) FROM( Select customer_id from GATABLE2 where category_id in (175925, 101450, 9005, 175930, 175930, 175940,175945,101450, 6453) group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )

The table has 600M rows. 

At first, the query took 1m, when we added an index on category_id the query started taking 3m. 

 

The SQL execution plan for both queries is attached. 

 

We are using a single x1.16xlarge insntace with query parallelism set to 32 

 

Cheers,

Eugene

 


 

--

Best regards,
Andrey V. Mashenkov

 

Dave Harvey Dave Harvey
Reply | Threaded
Open this post in threaded view
|

Re: Query 3x slower with index

In reply to this post by Stanislav Lukyanov
"Ignite will only use one index per table"

I assume you mean "Ignite will only use one index per table per query"?

On Thu, Oct 11, 2018 at 1:55 PM Stanislav Lukyanov <[hidden email]> wrote:

Hi,

 

It is a rather lengthy thread and I can’t dive into details right now,

but AFAICS the issue now is making affinity key index to work with a secondary index.

The important things to understand is

  1. Ignite will only use one index per table
  2. In case of a composite index, it will apply the columns one by one
  3. The affinity key index should always go first as the first step is splitting the query by affinity key values

 

So, to use index over the affinity key (customer_id) and a secondary index (category_id) one needs to create an index

like (customer_id, category_id), in that order, with no columns in between.

Note that index (customer_id, dt, category_id) can’t be used instead of it.

On the other hand, (customer_id, category_id, dt) can - the last part of the index will be left unused.

 

Thanks,

Stan

 

From: [hidden email]
Sent: 9 октября 2018 г. 19:40
To: [hidden email]
Subject: Re: Query 3x slower with index

 

Hi Ilya, 

 

I have tried it, and got the same performance as forcing using category index in my initial benchmark - query is 3x slowers and uses only one thread. 

 

From my experiments so far it seems like Ignite can either (a) use affinity key and run queries in parallel, (b) use index but run the query on only one thread. 

 

Has anybody been able to run OLAP like queries in while using an index? 

 

Cheers,

Eugene

 

On Mon, Sep 24, 2018 at 10:55 AM Ilya Kasnacheev <[hidden email]> wrote:

Hello!

 

I guess that using AFFINITY_KEY as index have something to do with the fact that GROUP BY really wants to work per-partition.

 

I have the following query for you:

 

1: jdbc:ignite:thin://localhost> explain Select count(*) FROM( Select customer_id from (Select customer_id, product_views_app, product_clict_app from GA_DATA ga join table(category_id int = ( 117930, 175930, 175940,175945,101450)) cats on cats.category_id = ga.category_id) data group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clict_app) > 1);
PLAN  SELECT
    DATA__Z2.CUSTOMER_ID AS __C0_0,
    SUM(DATA__Z2.PRODUCT_VIEWS_APP) AS __C0_1,
    SUM(DATA__Z2.PRODUCT_CLICT_APP) AS __C0_2
FROM (
    SELECT
        GA__Z0.CUSTOMER_ID,
        GA__Z0.PRODUCT_VIEWS_APP,
        GA__Z0.PRODUCT_CLICT_APP
    FROM TABLE(CATEGORY_ID INTEGER=(117930, 175930, 175940, 175945, 101450)) CATS__Z1
    INNER JOIN PUBLIC.GA_DATA GA__Z0
        ON 1=1
    WHERE CATS__Z1.CATEGORY_ID = GA__Z0.CATEGORY_ID
) DATA__Z2
    /* SELECT
        GA__Z0.CUSTOMER_ID,
        GA__Z0.PRODUCT_VIEWS_APP,
        GA__Z0.PRODUCT_CLICT_APP
    FROM TABLE(CATEGORY_ID INTEGER=(117930, 175930, 175940, 175945, 101450)) CATS__Z1
        /++ function ++/
    INNER JOIN PUBLIC.GA_DATA GA__Z0
        /++ PUBLIC.GA_CATEGORY_ID: CATEGORY_ID = CATS__Z1.CATEGORY_ID ++/
        ON 1=1
    WHERE CATS__Z1.CATEGORY_ID = GA__Z0.CATEGORY_ID
     */
GROUP BY DATA__Z2.CUSTOMER_ID

PLAN  SELECT
    COUNT(*)
FROM (
    SELECT
        __C0_0 AS CUSTOMER_ID
    FROM PUBLIC.__T0
    GROUP BY __C0_0
    HAVING (SUM(__C0_1) > 2)
        OR (SUM(__C0_2) > 1)
) _18__Z3
    /* SELECT
        __C0_0 AS CUSTOMER_ID
    FROM PUBLIC.__T0
        /++ PUBLIC."merge_scan" ++/
    GROUP BY __C0_0
    HAVING (SUM(__C0_1) > 2)
        OR (SUM(__C0_2) > 1)
     */

 

However, I'm not sure it is "optimal" or not since I have no idea if it will perform better or worse on real data. That's why I need a subset of data which will make query execution speed readily visible. Unfortunately, I can't deduce that from query plan alone.

 

Regards,

--

Ilya Kasnacheev

 

 

пн, 24 сент. 2018 г. в 16:14, eugene miretsky <[hidden email]>:

An easy way to reproduce would be to 

 

1. Create table

CREATE TABLE GA_DATA (

    customer_id bigint,

    dt timestamp,

    category_id int,

    product_views_app int,

    product_clict_app int,

    product_clict_web int,

    product_clict_web int,

    PRIMARY KEY (customer_id, dt, category_id)

) WITH "template=ga_template, backups=0, affinityKey=customer_id";

 

2. Create indexes

·         CREATE INDEX ga_customer_id ON GA_Data (customer_id)

·         CREATE INDEX ga_pKey ON GA_Data (customer_id, dt, category_id)

·         CREATE INDEX ga_category_and_customer_id ON GA_Data (category_id, customer_id)

·         CREATE INDEX ga_category_id ON GA_Data (category_id)

3. Run Explain on the following queries while trying forcing using different indexes

·         Select count(*) FROM( 

Select customer_id from GA_DATA  use index (ga_category_id)

where category_id in (117930, 175930, 175940,175945,101450) 

group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )

 

·         Select count(*) FROM( 

    Select customer_id from GA_DATA ga use index (ga_pKey)

    join table(category_id int = ( 117930, 175930, 175940,175945,101450)) cats on cats.category_id = ga.category_id   

    group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 

 

The execution plans will be similar to what I have posted earler. In particular, only on of (a) affinty key index, (b) category_id index will be used.

 

On Fri, Sep 21, 2018 at 8:49 AM Ilya Kasnacheev <[hidden email]> wrote:

Hello!

 

Can you share a reproducer project which loads (or generates) data for caches and then queries them? I could try and debug it if I had the reproducer.

 

Regards.

--

Ilya Kasnacheev

 

 

чт, 20 сент. 2018 г. в 21:05, eugene miretsky <[hidden email]>:

Thanks Ilya, 

 

Tried it, no luck. It performs the same as when using category_id index alone (slow).

  Any combindation I try either uses AFFINITY_KEY or category index. When it uses category index it runs slowers. 

 

Also, when AFFINITY_KEY key is used, the jobs runs on 32 threads (my query parallelism settings ) when category_id is used, the jobs runs on one thread most of the time (first few seconds it looks like more threads are doing work). 

 

Please help on this. It seems like a very simple use case (using affinity key and another index), either I am doing something extremly silly, or I stumbled on a bug in Ignite that's effecting a lot of people.

 

Cheers,

Eugene

 

On Thu, Sep 20, 2018 at 6:22 AM Ilya Kasnacheev <[hidden email]> wrote:

Hello!

 

> 2) ga_customer_and_category_id: on customer_id and category_id

 

Have you tried to do an index on category_id first, customer_id second? Note that Ignite will use only one index when joining two tables and that in your case it should start with category_id.

 

You can also try adding affinity key to this index in various places, see if it helps further.

 

Regards,

--

Ilya Kasnacheev

 

 

ср, 19 сент. 2018 г. в 21:27, eugene miretsky <[hidden email]>:

Hi Ilya, 

 

I created 4 indexs on the table:

1) ga_pKey: on customer_id, dt, category_id (that's our primary key columns)

2) ga_customer_and_category_id: on customer_id and category_id

2) ga_customer_id: on customer_id

4) ga_category_id: on category_id

 

 

For the first query (category in ()), the execution plan when using the first 3 index is exactly the same  - using /* PUBLIC.AFFINITY_KEY */

When using #4 (alone or in combination with any of the other 3)

1.       /* PUBLIC.AFFINITY_KEY */ is replaced with  /* PUBLIC.GA_CATEGORY_ID: CATEGORY_ID IN(117930, 175930, 175940, 175945, 101450) */

2.       The query runs slower.

For the second query (join on an inlined table) the behaviour is very similar. Using the first 3 indexes results in the same plan - using  /* PUBLIC.AFFINITY_KEY */ and  /* function: CATEGORY_ID = GA__Z0.CATEGORY_ID */. 

When using #4 (alone or in combination with any of the other 3)

1.       /* function */ and /* PUBLIC.GA_CATEGORY_ID: CATEGORY_ID = CATS__Z1.CATEGORY_ID */ are used

2.       The query is much slower. 

 

Theoretically the query seems pretty simple

1.       Use affinity key  to make sure the query runs in parallel and there are no shuffles 

2.       Filter rows that match category_id using the category_id index

3.       Used customer_id index for the group_by (not sure if this step makes sense)

But I cannot get it to work.

 

Cheers,

Eugene

 

 

 

 

On Tue, Sep 18, 2018 at 10:56 AM Ilya Kasnacheev <[hidden email]> wrote:

Hello!

 

I can see you try to use _key_PK as index. If your primary key is composite, it won't work properly for you. I recommend creating an explicit (category_id, customer_id) index.

 

Regards,

--

Ilya Kasnacheev

 

 

вт, 18 сент. 2018 г. в 17:47, eugene miretsky <[hidden email]>:

Hi Ilya, 

 

The different query result was my mistake - one of the categoy_ids was duplicate, so in the query that used join, it counted rows for that category twice. My apologies. 

 

However, we are still having an issue with query time, and the index not being applied to category_id. Would appreciate if you could take a look. 

 

Cheers,

Eugene

 

On Mon, Sep 17, 2018 at 9:15 AM Ilya Kasnacheev <[hidden email]> wrote:

Hello!

 

Why don't you diff the results of those two queries, tell us what the difference is?

 

Regards,

--

Ilya Kasnacheev

 

 

пн, 17 сент. 2018 г. в 16:08, eugene miretsky <[hidden email]>:

Hello, 

 

Just wanted to see if anybody had time to look into this. 

 

Cheers,

Eugene

 

On Wed, Sep 12, 2018 at 6:29 PM eugene miretsky <[hidden email]> wrote:

Thanks! 

 

Tried joining with an inlined table instead of IN as per the second suggestion, and it didn't quite work. 

 

Query1: 

·         Select COUNT(*) FROM( Select customer_id from GATABLE3  use Index( ) where category_id in (9005, 175930, 175930, 175940,175945,101450, 6453) group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )

·         exec time = 17s

·         Result: 3105868

·         Same exec time if using AFFINITY_KEY index or "_key_PK_hash or customer_id index

·         Using an index on category_id increases the query time 33s

Query2: 

·         Select COUNT(*) FROM( Select customer_id from GATABLE3 ga  use index (PUBLIC."_key_PK") inner join table(category_id int = (9005, 175930, 175930, 175940,175945,101450, 6453)) cats on cats.category_id = ga.category_id   group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )

·         exec time = 38s

·         Result: 3113921

·         Same exec time if using AFFINITY_KEY index or "_key_PK_hash or customer_id index or category_id index

·         Using an index on category_id doesnt change the run time

Query plans are attached. 

 

3 questions:

1.       Why is the result differnt for the 2 queries - this is quite concerning. 

2.       Why is the 2nd query taking longer

3.       Why  category_id index doesn't work in case of query 2. 

 

On Wed, Sep 5, 2018 at 8:31 AM Ilya Kasnacheev <[hidden email]> wrote:

Hello!

 

I don't think that we're able to use index with IN () clauses. Please convert it into OR clauses.

 

 

Regards,

--

Ilya Kasnacheev

 

 

пн, 3 сент. 2018 г. в 12:46, Andrey Mashenkov <[hidden email]>:

Hi

 

Actually, first query uses index on affinity key which looks more efficient than index on category_id column.

The first query can process groups one by one and stream partial results from map phase to reduce phase as it use sorted index lookup, 

while second query should process full dataset on map phase before pass it for reducing.

 

Try to use composite index (customer_id, category_id).

 

Also, SqlQueryFields.setCollocated(true) flag can help Ignite to build more efficient plan when group by on collocated column is used.

 

On Sun, Sep 2, 2018 at 2:02 AM eugene miretsky <[hidden email]> wrote:

Hello, 

 

Schema:

·          

PUBLIC.GATABLE2.CUSTOMER_ID

PUBLIC.GATABLE2.DT

PUBLIC.GATABLE2.CATEGORY_ID

PUBLIC.GATABLE2.VERTICAL_ID

PUBLIC.GATABLE2.SERVICE

PUBLIC.GATABLE2.PRODUCT_VIEWS_APP

PUBLIC.GATABLE2.PRODUCT_CLICKS_APP

PUBLIC.GATABLE2.PRODUCT_VIEWS_WEB

PUBLIC.GATABLE2.PRODUCT_CLICKS_WEB

PUBLIC.GATABLE2.PDP_SESSIONS_APP

PUBLIC.GATABLE2.PDP_SESSIONS_WEB

·         pkey = customer_id,dt

·         affinityKey = customer

Query:

·         select COUNT(*) FROM( Select customer_id from GATABLE2 where category_id in (175925, 101450, 9005, 175930, 175930, 175940,175945,101450, 6453) group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )

The table has 600M rows. 

At first, the query took 1m, when we added an index on category_id the query started taking 3m. 

 

The SQL execution plan for both queries is attached. 

 

We are using a single x1.16xlarge insntace with query parallelism set to 32 

 

Cheers,

Eugene

 


 

--

Best regards,
Andrey V. Mashenkov

 



Disclaimer

The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful.

This email has been scanned for viruses and malware, and may have been automatically archived by Mimecast Ltd, an innovator in Software as a Service (SaaS) for business. Providing a safer and more useful place for your human generated data. Specializing in; Security, archiving and compliance. To find out more Click Here.

wt wt
Reply | Threaded
Open this post in threaded view
|

Re: Query 3x slower with index

having worked with databases for 20 years i can see your indexes are not
fully scoped. i see this as your issue

having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1

add those columns to the composite index so that it doesn't need to access
the underlying table and can just use the index.



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
Stanislav Lukyanov Stanislav Lukyanov
Reply | Threaded
Open this post in threaded view
|

RE: Query 3x slower with index

In reply to this post by Dave Harvey

Yes, sure.

 

From: [hidden email]
Sent: 11 октября 2018 г. 23:59
To: [hidden email]
Subject: Re: Query 3x slower with index

 

"Ignite will only use one index per table"

 

I assume you mean "Ignite will only use one index per table per query"?

 

On Thu, Oct 11, 2018 at 1:55 PM Stanislav Lukyanov <[hidden email]> wrote:

Hi,

 

It is a rather lengthy thread and I can’t dive into details right now,

but AFAICS the issue now is making affinity key index to work with a secondary index.

The important things to understand is

  1. Ignite will only use one index per table
  2. In case of a composite index, it will apply the columns one by one
  3. The affinity key index should always go first as the first step is splitting the query by affinity key values

 

So, to use index over the affinity key (customer_id) and a secondary index (category_id) one needs to create an index

like (customer_id, category_id), in that order, with no columns in between.

Note that index (customer_id, dt, category_id) can’t be used instead of it.

On the other hand, (customer_id, category_id, dt) can - the last part of the index will be left unused.

 

Thanks,

Stan

 

From: [hidden email]
Sent: 9 октября 2018 г. 19:40
To: [hidden email]
Subject: Re: Query 3x slower with index

 

Hi Ilya, 

 

I have tried it, and got the same performance as forcing using category index in my initial benchmark - query is 3x slowers and uses only one thread. 

 

From my experiments so far it seems like Ignite can either (a) use affinity key and run queries in parallel, (b) use index but run the query on only one thread. 

 

Has anybody been able to run OLAP like queries in while using an index? 

 

Cheers,

Eugene

 

On Mon, Sep 24, 2018 at 10:55 AM Ilya Kasnacheev <[hidden email]> wrote:

Hello!

 

I guess that using AFFINITY_KEY as index have something to do with the fact that GROUP BY really wants to work per-partition.

 

I have the following query for you:

 

1: jdbc:ignite:thin://localhost> explain Select count(*) FROM( Select customer_id from (Select customer_id, product_views_app, product_clict_app from GA_DATA ga join table(category_id int = ( 117930, 175930, 175940,175945,101450)) cats on cats.category_id = ga.category_id) data group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clict_app) > 1);
PLAN  SELECT
    DATA__Z2.CUSTOMER_ID AS __C0_0,
    SUM(DATA__Z2.PRODUCT_VIEWS_APP) AS __C0_1,
    SUM(DATA__Z2.PRODUCT_CLICT_APP) AS __C0_2
FROM (
    SELECT
        GA__Z0.CUSTOMER_ID,
        GA__Z0.PRODUCT_VIEWS_APP,
        GA__Z0.PRODUCT_CLICT_APP
    FROM TABLE(CATEGORY_ID INTEGER=(117930, 175930, 175940, 175945, 101450)) CATS__Z1
    INNER JOIN PUBLIC.GA_DATA GA__Z0
        ON 1=1
    WHERE CATS__Z1.CATEGORY_ID = GA__Z0.CATEGORY_ID
) DATA__Z2
    /* SELECT
        GA__Z0.CUSTOMER_ID,
        GA__Z0.PRODUCT_VIEWS_APP,
        GA__Z0.PRODUCT_CLICT_APP
    FROM TABLE(CATEGORY_ID INTEGER=(117930, 175930, 175940, 175945, 101450)) CATS__Z1
        /++ function ++/
    INNER JOIN PUBLIC.GA_DATA GA__Z0
        /++ PUBLIC.GA_CATEGORY_ID: CATEGORY_ID = CATS__Z1.CATEGORY_ID ++/
        ON 1=1
    WHERE CATS__Z1.CATEGORY_ID = GA__Z0.CATEGORY_ID
     */
GROUP BY DATA__Z2.CUSTOMER_ID

PLAN  SELECT
    COUNT(*)
FROM (
    SELECT
        __C0_0 AS CUSTOMER_ID
    FROM PUBLIC.__T0
    GROUP BY __C0_0
    HAVING (SUM(__C0_1) > 2)
        OR (SUM(__C0_2) > 1)
) _18__Z3
    /* SELECT
        __C0_0 AS CUSTOMER_ID
    FROM PUBLIC.__T0
        /++ PUBLIC."merge_scan" ++/
    GROUP BY __C0_0
    HAVING (SUM(__C0_1) > 2)
        OR (SUM(__C0_2) > 1)
     */

 

However, I'm not sure it is "optimal" or not since I have no idea if it will perform better or worse on real data. That's why I need a subset of data which will make query execution speed readily visible. Unfortunately, I can't deduce that from query plan alone.

 

Regards,

--

Ilya Kasnacheev

 

 

пн, 24 сент. 2018 г. в 16:14, eugene miretsky <[hidden email]>:

An easy way to reproduce would be to 

 

1. Create table

CREATE TABLE GA_DATA (

    customer_id bigint,

    dt timestamp,

    category_id int,

    product_views_app int,

    product_clict_app int,

    product_clict_web int,

    product_clict_web int,

    PRIMARY KEY (customer_id, dt, category_id)

) WITH "template=ga_template, backups=0, affinityKey=customer_id";

 

2. Create indexes

·         CREATE INDEX ga_customer_id ON GA_Data (customer_id)

·         CREATE INDEX ga_pKey ON GA_Data (customer_id, dt, category_id)

·         CREATE INDEX ga_category_and_customer_id ON GA_Data (category_id, customer_id)

·         CREATE INDEX ga_category_id ON GA_Data (category_id)

3. Run Explain on the following queries while trying forcing using different indexes

·         Select count(*) FROM( 

Select customer_id from GA_DATA  use index (ga_category_id)

where category_id in (117930, 175930, 175940,175945,101450) 

group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )

 

·         Select count(*) FROM( 

    Select customer_id from GA_DATA ga use index (ga_pKey)

    join table(category_id int = ( 117930, 175930, 175940,175945,101450)) cats on cats.category_id = ga.category_id   

    group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 

 

The execution plans will be similar to what I have posted earler. In particular, only on of (a) affinty key index, (b) category_id index will be used.

 

On Fri, Sep 21, 2018 at 8:49 AM Ilya Kasnacheev <[hidden email]> wrote:

Hello!

 

Can you share a reproducer project which loads (or generates) data for caches and then queries them? I could try and debug it if I had the reproducer.

 

Regards.

--

Ilya Kasnacheev

 

 

чт, 20 сент. 2018 г. в 21:05, eugene miretsky <[hidden email]>:

Thanks Ilya, 

 

Tried it, no luck. It performs the same as when using category_id index alone (slow).

  Any combindation I try either uses AFFINITY_KEY or category index. When it uses category index it runs slowers. 

 

Also, when AFFINITY_KEY key is used, the jobs runs on 32 threads (my query parallelism settings ) when category_id is used, the jobs runs on one thread most of the time (first few seconds it looks like more threads are doing work). 

 

Please help on this. It seems like a very simple use case (using affinity key and another index), either I am doing something extremly silly, or I stumbled on a bug in Ignite that's effecting a lot of people.

 

Cheers,

Eugene

 

On Thu, Sep 20, 2018 at 6:22 AM Ilya Kasnacheev <[hidden email]> wrote:

Hello!

 

> 2) ga_customer_and_category_id: on customer_id and category_id

 

Have you tried to do an index on category_id first, customer_id second? Note that Ignite will use only one index when joining two tables and that in your case it should start with category_id.

 

You can also try adding affinity key to this index in various places, see if it helps further.

 

Regards,

--

Ilya Kasnacheev

 

 

ср, 19 сент. 2018 г. в 21:27, eugene miretsky <[hidden email]>:

Hi Ilya, 

 

I created 4 indexs on the table:

1) ga_pKey: on customer_id, dt, category_id (that's our primary key columns)

2) ga_customer_and_category_id: on customer_id and category_id

2) ga_customer_id: on customer_id

4) ga_category_id: on category_id

 

 

For the first query (category in ()), the execution plan when using the first 3 index is exactly the same  - using /* PUBLIC.AFFINITY_KEY */

When using #4 (alone or in combination with any of the other 3)

1.       /* PUBLIC.AFFINITY_KEY */ is replaced with  /* PUBLIC.GA_CATEGORY_ID: CATEGORY_ID IN(117930, 175930, 175940, 175945, 101450) */

2.       The query runs slower.

For the second query (join on an inlined table) the behaviour is very similar. Using the first 3 indexes results in the same plan - using  /* PUBLIC.AFFINITY_KEY */ and  /* function: CATEGORY_ID = GA__Z0.CATEGORY_ID */. 

When using #4 (alone or in combination with any of the other 3)

1.       /* function */ and /* PUBLIC.GA_CATEGORY_ID: CATEGORY_ID = CATS__Z1.CATEGORY_ID */ are used

2.       The query is much slower. 

 

Theoretically the query seems pretty simple

1.       Use affinity key  to make sure the query runs in parallel and there are no shuffles 

2.       Filter rows that match category_id using the category_id index

3.       Used customer_id index for the group_by (not sure if this step makes sense)

But I cannot get it to work.

 

Cheers,

Eugene

 

 

 

 

On Tue, Sep 18, 2018 at 10:56 AM Ilya Kasnacheev <[hidden email]> wrote:

Hello!

 

I can see you try to use _key_PK as index. If your primary key is composite, it won't work properly for you. I recommend creating an explicit (category_id, customer_id) index.

 

Regards,

--

Ilya Kasnacheev

 

 

вт, 18 сент. 2018 г. в 17:47, eugene miretsky <[hidden email]>:

Hi Ilya, 

 

The different query result was my mistake - one of the categoy_ids was duplicate, so in the query that used join, it counted rows for that category twice. My apologies. 

 

However, we are still having an issue with query time, and the index not being applied to category_id. Would appreciate if you could take a look. 

 

Cheers,

Eugene

 

On Mon, Sep 17, 2018 at 9:15 AM Ilya Kasnacheev <[hidden email]> wrote:

Hello!

 

Why don't you diff the results of those two queries, tell us what the difference is?

 

Regards,

--

Ilya Kasnacheev

 

 

пн, 17 сент. 2018 г. в 16:08, eugene miretsky <[hidden email]>:

Hello, 

 

Just wanted to see if anybody had time to look into this. 

 

Cheers,

Eugene

 

On Wed, Sep 12, 2018 at 6:29 PM eugene miretsky <[hidden email]> wrote:

Thanks! 

 

Tried joining with an inlined table instead of IN as per the second suggestion, and it didn't quite work. 

 

Query1: 

·         Select COUNT(*) FROM( Select customer_id from GATABLE3  use Index( ) where category_id in (9005, 175930, 175930, 175940,175945,101450, 6453) group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )

·         exec time = 17s

·         Result: 3105868

·         Same exec time if using AFFINITY_KEY index or "_key_PK_hash or customer_id index

·         Using an index on category_id increases the query time 33s

Query2: 

·         Select COUNT(*) FROM( Select customer_id from GATABLE3 ga  use index (PUBLIC."_key_PK") inner join table(category_id int = (9005, 175930, 175930, 175940,175945,101450, 6453)) cats on cats.category_id = ga.category_id   group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )

·         exec time = 38s

·         Result: 3113921

·         Same exec time if using AFFINITY_KEY index or "_key_PK_hash or customer_id index or category_id index

·         Using an index on category_id doesnt change the run time

Query plans are attached. 

 

3 questions:

1.       Why is the result differnt for the 2 queries - this is quite concerning. 

2.       Why is the 2nd query taking longer

3.       Why  category_id index doesn't work in case of query 2. 

 

On Wed, Sep 5, 2018 at 8:31 AM Ilya Kasnacheev <[hidden email]> wrote:

Hello!

 

I don't think that we're able to use index with IN () clauses. Please convert it into OR clauses.

 

 

Regards,

--

Ilya Kasnacheev

 

 

пн, 3 сент. 2018 г. в 12:46, Andrey Mashenkov <[hidden email]>:

Hi

 

Actually, first query uses index on affinity key which looks more efficient than index on category_id column.

The first query can process groups one by one and stream partial results from map phase to reduce phase as it use sorted index lookup, 

while second query should process full dataset on map phase before pass it for reducing.

 

Try to use composite index (customer_id, category_id).

 

Also, SqlQueryFields.setCollocated(true) flag can help Ignite to build more efficient plan when group by on collocated column is used.

 

On Sun, Sep 2, 2018 at 2:02 AM eugene miretsky <[hidden email]> wrote:

Hello, 

 

Schema:

·          

PUBLIC.GATABLE2.CUSTOMER_ID

PUBLIC.GATABLE2.DT

PUBLIC.GATABLE2.CATEGORY_ID

PUBLIC.GATABLE2.VERTICAL_ID

PUBLIC.GATABLE2.SERVICE

PUBLIC.GATABLE2.PRODUCT_VIEWS_APP

PUBLIC.GATABLE2.PRODUCT_CLICKS_APP

PUBLIC.GATABLE2.PRODUCT_VIEWS_WEB

PUBLIC.GATABLE2.PRODUCT_CLICKS_WEB

PUBLIC.GATABLE2.PDP_SESSIONS_APP

PUBLIC.GATABLE2.PDP_SESSIONS_WEB

·         pkey = customer_id,dt

·         affinityKey = customer

Query:

·         select COUNT(*) FROM( Select customer_id from GATABLE2 where category_id in (175925, 101450, 9005, 175930, 175930, 175940,175945,101450, 6453) group by customer_id having SUM(product_views_app) > 2 OR  SUM(product_clicks_app) > 1 )

The table has 600M rows. 

At first, the query took 1m, when we added an index on category_id the query started taking 3m. 

 

The SQL execution plan for both queries is attached. 

 

We are using a single x1.16xlarge insntace with query parallelism set to 32 

 

Cheers,

Eugene

 


 

--

Best regards,
Andrey V. Mashenkov

 

 

Disclaimer

The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful.

This email has been scanned for viruses and malware, and may have been automatically archived by Mimecast Ltd, an innovator in Software as a Service (SaaS) for business. Providing a safer and more useful place for your human generated data. Specializing in; Security, archiving and compliance. To find out more Click Here.