Recommended way of random sampling

classic Classic list List threaded Threaded
7 messages Options
Stéphane Thibaud-2 Stéphane Thibaud-2
Reply | Threaded
Open this post in threaded view
|

Recommended way of random sampling

Hello Ignite users,

I am considering to sample randomly on large amounts of data, but I was wondering what would be the most efficient way for this. Right now, I think I might need cluster-based randomness using a MOD function as described here: https://www.alandix.com/academic/topics/random/sampling-SQL.html

I currently have a UUID column (uuid4), which I think can be used for it, but I might need some bit manipulation to get the non-random parts out of the UUID.
Do you think this is indeed the most straightforward way to do it?


Kind regards,

Stéphane Thibaud
Stéphane Thibaud-2 Stéphane Thibaud-2
Reply | Threaded
Open this post in threaded view
|

Re: Recommended way of random sampling

As a small addition: it would really help if Ignite had a hashing function for this, but I only see AES encryption.


Kind regards,

Stéphane Thibaud

2019年5月19日(日) 20:59 Stéphane Thibaud <[hidden email]>:
Hello Ignite users,

I am considering to sample randomly on large amounts of data, but I was wondering what would be the most efficient way for this. Right now, I think I might need cluster-based randomness using a MOD function as described here: https://www.alandix.com/academic/topics/random/sampling-SQL.html

I currently have a UUID column (uuid4), which I think can be used for it, but I might need some bit manipulation to get the non-random parts out of the UUID.
Do you think this is indeed the most straightforward way to do it?


Kind regards,

Stéphane Thibaud
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Recommended way of random sampling

Hello!

You can have a random indexed field in your table and do queries like SELECT * FROM table WHERE rand_field < RAND() LIMIT 1; to sample random item.

Regards,
--
Ilya Kasnacheev


пн, 20 мая 2019 г. в 04:50, Stéphane Thibaud <[hidden email]>:
As a small addition: it would really help if Ignite had a hashing function for this, but I only see AES encryption.


Kind regards,

Stéphane Thibaud

2019年5月19日(日) 20:59 Stéphane Thibaud <[hidden email]>:
Hello Ignite users,

I am considering to sample randomly on large amounts of data, but I was wondering what would be the most efficient way for this. Right now, I think I might need cluster-based randomness using a MOD function as described here: https://www.alandix.com/academic/topics/random/sampling-SQL.html

I currently have a UUID column (uuid4), which I think can be used for it, but I might need some bit manipulation to get the non-random parts out of the UUID.
Do you think this is indeed the most straightforward way to do it?


Kind regards,

Stéphane Thibaud
Stéphane Thibaud-2 Stéphane Thibaud-2
Reply | Threaded
Open this post in threaded view
|

Re: Recommended way of random sampling

Hello Ilya,

Thank you for that suggestion. On a traditional database I know that approach does not scale well, since a random number is first assigned to all rows (it scales linearly with the number of rows if I am not mistaken). Do you think this would be different for Ignite?


Kind regards,

Stéphane Thibaud

2019年5月20日(月) 15:53 Ilya Kasnacheev <[hidden email]>:
Hello!

You can have a random indexed field in your table and do queries like SELECT * FROM table WHERE rand_field < RAND() LIMIT 1; to sample random item.

Regards,
--
Ilya Kasnacheev


пн, 20 мая 2019 г. в 04:50, Stéphane Thibaud <[hidden email]>:
As a small addition: it would really help if Ignite had a hashing function for this, but I only see AES encryption.


Kind regards,

Stéphane Thibaud

2019年5月19日(日) 20:59 Stéphane Thibaud <[hidden email]>:
Hello Ignite users,

I am considering to sample randomly on large amounts of data, but I was wondering what would be the most efficient way for this. Right now, I think I might need cluster-based randomness using a MOD function as described here: https://www.alandix.com/academic/topics/random/sampling-SQL.html

I currently have a UUID column (uuid4), which I think can be used for it, but I might need some bit manipulation to get the non-random parts out of the UUID.
Do you think this is indeed the most straightforward way to do it?


Kind regards,

Stéphane Thibaud
Stéphane Thibaud-2 Stéphane Thibaud-2
Reply | Threaded
Open this post in threaded view
|

Re: Recommended way of random sampling

Excuse me, I just sent my response, but I see that you actually suggested a new column... in that case this would work, but I think it's a bit unfortunate to have to store random numbers.


Kind regards,

Stéphane Thibaud

2019年5月20日(月) 19:22 Stéphane Thibaud <[hidden email]>:
Hello Ilya,

Thank you for that suggestion. On a traditional database I know that approach does not scale well, since a random number is first assigned to all rows (it scales linearly with the number of rows if I am not mistaken). Do you think this would be different for Ignite?


Kind regards,

Stéphane Thibaud

2019年5月20日(月) 15:53 Ilya Kasnacheev <[hidden email]>:
Hello!

You can have a random indexed field in your table and do queries like SELECT * FROM table WHERE rand_field < RAND() LIMIT 1; to sample random item.

Regards,
--
Ilya Kasnacheev


пн, 20 мая 2019 г. в 04:50, Stéphane Thibaud <[hidden email]>:
As a small addition: it would really help if Ignite had a hashing function for this, but I only see AES encryption.


Kind regards,

Stéphane Thibaud

2019年5月19日(日) 20:59 Stéphane Thibaud <[hidden email]>:
Hello Ignite users,

I am considering to sample randomly on large amounts of data, but I was wondering what would be the most efficient way for this. Right now, I think I might need cluster-based randomness using a MOD function as described here: https://www.alandix.com/academic/topics/random/sampling-SQL.html

I currently have a UUID column (uuid4), which I think can be used for it, but I might need some bit manipulation to get the non-random parts out of the UUID.
Do you think this is indeed the most straightforward way to do it?


Kind regards,

Stéphane Thibaud
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Recommended way of random sampling

In reply to this post by Stéphane Thibaud-2
Hello!

I'm not sure why you think it will not scale. If this field is indexed then taking a random sample is basically one b-tree walk away.

I guess you will have to store random numbers, if you rely on non-random field it might introduce bias.

Regards,
--
Ilya Kasnacheev


пн, 20 мая 2019 г. в 13:28, Stéphane Thibaud <[hidden email]>:
Hello Ilya,

Thank you for that suggestion. On a traditional database I know that approach does not scale well, since a random number is first assigned to all rows (it scales linearly with the number of rows if I am not mistaken). Do you think this would be different for Ignite?


Kind regards,

Stéphane Thibaud

2019年5月20日(月) 15:53 Ilya Kasnacheev <[hidden email]>:
Hello!

You can have a random indexed field in your table and do queries like SELECT * FROM table WHERE rand_field < RAND() LIMIT 1; to sample random item.

Regards,
--
Ilya Kasnacheev


пн, 20 мая 2019 г. в 04:50, Stéphane Thibaud <[hidden email]>:
As a small addition: it would really help if Ignite had a hashing function for this, but I only see AES encryption.


Kind regards,

Stéphane Thibaud

2019年5月19日(日) 20:59 Stéphane Thibaud <[hidden email]>:
Hello Ignite users,

I am considering to sample randomly on large amounts of data, but I was wondering what would be the most efficient way for this. Right now, I think I might need cluster-based randomness using a MOD function as described here: https://www.alandix.com/academic/topics/random/sampling-SQL.html

I currently have a UUID column (uuid4), which I think can be used for it, but I might need some bit manipulation to get the non-random parts out of the UUID.
Do you think this is indeed the most straightforward way to do it?


Kind regards,

Stéphane Thibaud
Stéphane Thibaud-2 Stéphane Thibaud-2
Reply | Threaded
Open this post in threaded view
|

Re: Recommended way of random sampling

Hello Ilya,

Yes, you are right. I sent my first response too quickly. With an extra column it will work.
I will go with the approach you suggest. :-)


Kind regards,

Stéphane Thibaud

2019年5月20日(月) 20:26 Ilya Kasnacheev <[hidden email]>:
Hello!

I'm not sure why you think it will not scale. If this field is indexed then taking a random sample is basically one b-tree walk away.

I guess you will have to store random numbers, if you rely on non-random field it might introduce bias.

Regards,
--
Ilya Kasnacheev


пн, 20 мая 2019 г. в 13:28, Stéphane Thibaud <[hidden email]>:
Hello Ilya,

Thank you for that suggestion. On a traditional database I know that approach does not scale well, since a random number is first assigned to all rows (it scales linearly with the number of rows if I am not mistaken). Do you think this would be different for Ignite?


Kind regards,

Stéphane Thibaud

2019年5月20日(月) 15:53 Ilya Kasnacheev <[hidden email]>:
Hello!

You can have a random indexed field in your table and do queries like SELECT * FROM table WHERE rand_field < RAND() LIMIT 1; to sample random item.

Regards,
--
Ilya Kasnacheev


пн, 20 мая 2019 г. в 04:50, Stéphane Thibaud <[hidden email]>:
As a small addition: it would really help if Ignite had a hashing function for this, but I only see AES encryption.


Kind regards,

Stéphane Thibaud

2019年5月19日(日) 20:59 Stéphane Thibaud <[hidden email]>:
Hello Ignite users,

I am considering to sample randomly on large amounts of data, but I was wondering what would be the most efficient way for this. Right now, I think I might need cluster-based randomness using a MOD function as described here: https://www.alandix.com/academic/topics/random/sampling-SQL.html

I currently have a UUID column (uuid4), which I think can be used for it, but I might need some bit manipulation to get the non-random parts out of the UUID.
Do you think this is indeed the most straightforward way to do it?


Kind regards,

Stéphane Thibaud