

Hello Ignite users,
I am considering to sample randomly on large amounts of data, but I was wondering what would be the most efficient way for this. Right now, I think I might need clusterbased randomness using a MOD function as described here: https://www.alandix.com/academic/topics/random/samplingSQL.htmlI currently have a UUID column (uuid4), which I think can be used for it, but I might need some bit manipulation to get the nonrandom parts out of the UUID. Do you think this is indeed the most straightforward way to do it?
Kind regards,
Stéphane Thibaud


As a small addition: it would really help if Ignite had a hashing function for this, but I only see AES encryption.
Kind regards,
Stéphane Thibaud Hello Ignite users,
I am considering to sample randomly on large amounts of data, but I was wondering what would be the most efficient way for this. Right now, I think I might need clusterbased randomness using a MOD function as described here: https://www.alandix.com/academic/topics/random/samplingSQL.htmlI currently have a UUID column (uuid4), which I think can be used for it, but I might need some bit manipulation to get the nonrandom parts out of the UUID. Do you think this is indeed the most straightforward way to do it?
Kind regards,
Stéphane Thibaud


Hello!
You can have a random indexed field in your table and do queries like SELECT * FROM table WHERE rand_field < RAND() LIMIT 1; to sample random item.
Regards,
As a small addition: it would really help if Ignite had a hashing function for this, but I only see AES encryption.
Kind regards,
Stéphane Thibaud Hello Ignite users,
I am considering to sample randomly on large amounts of data, but I was wondering what would be the most efficient way for this. Right now, I think I might need clusterbased randomness using a MOD function as described here: https://www.alandix.com/academic/topics/random/samplingSQL.htmlI currently have a UUID column (uuid4), which I think can be used for it, but I might need some bit manipulation to get the nonrandom parts out of the UUID. Do you think this is indeed the most straightforward way to do it?
Kind regards,
Stéphane Thibaud


Hello Ilya,
Thank you for that suggestion. On a traditional database I know that approach does not scale well, since a random number is first assigned to all rows (it scales linearly with the number of rows if I am not mistaken). Do you think this would be different for Ignite?
Kind regards,
Stéphane Thibaud Hello!
You can have a random indexed field in your table and do queries like SELECT * FROM table WHERE rand_field < RAND() LIMIT 1; to sample random item.
Regards,
As a small addition: it would really help if Ignite had a hashing function for this, but I only see AES encryption.
Kind regards,
Stéphane Thibaud Hello Ignite users,
I am considering to sample randomly on large amounts of data, but I was wondering what would be the most efficient way for this. Right now, I think I might need clusterbased randomness using a MOD function as described here: https://www.alandix.com/academic/topics/random/samplingSQL.htmlI currently have a UUID column (uuid4), which I think can be used for it, but I might need some bit manipulation to get the nonrandom parts out of the UUID. Do you think this is indeed the most straightforward way to do it?
Kind regards,
Stéphane Thibaud


Excuse me, I just sent my response, but I see that you actually suggested a new column... in that case this would work, but I think it's a bit unfortunate to have to store random numbers.
Kind regards,
Stéphane Thibaud Hello Ilya,
Thank you for that suggestion. On a traditional database I know that approach does not scale well, since a random number is first assigned to all rows (it scales linearly with the number of rows if I am not mistaken). Do you think this would be different for Ignite?
Kind regards,
Stéphane Thibaud
Hello!
You can have a random indexed field in your table and do queries like SELECT * FROM table WHERE rand_field < RAND() LIMIT 1; to sample random item.
Regards,
As a small addition: it would really help if Ignite had a hashing function for this, but I only see AES encryption.
Kind regards,
Stéphane Thibaud Hello Ignite users,
I am considering to sample randomly on large amounts of data, but I was wondering what would be the most efficient way for this. Right now, I think I might need clusterbased randomness using a MOD function as described here: https://www.alandix.com/academic/topics/random/samplingSQL.htmlI currently have a UUID column (uuid4), which I think can be used for it, but I might need some bit manipulation to get the nonrandom parts out of the UUID. Do you think this is indeed the most straightforward way to do it?
Kind regards,
Stéphane Thibaud


Hello!
I'm not sure why you think it will not scale. If this field is indexed then taking a random sample is basically one btree walk away.
I guess you will have to store random numbers, if you rely on nonrandom field it might introduce bias.
Regards,
Hello Ilya,
Thank you for that suggestion. On a traditional database I know that approach does not scale well, since a random number is first assigned to all rows (it scales linearly with the number of rows if I am not mistaken). Do you think this would be different for Ignite?
Kind regards,
Stéphane Thibaud
Hello!
You can have a random indexed field in your table and do queries like SELECT * FROM table WHERE rand_field < RAND() LIMIT 1; to sample random item.
Regards,
As a small addition: it would really help if Ignite had a hashing function for this, but I only see AES encryption.
Kind regards,
Stéphane Thibaud Hello Ignite users,
I am considering to sample randomly on large amounts of data, but I was wondering what would be the most efficient way for this. Right now, I think I might need clusterbased randomness using a MOD function as described here: https://www.alandix.com/academic/topics/random/samplingSQL.htmlI currently have a UUID column (uuid4), which I think can be used for it, but I might need some bit manipulation to get the nonrandom parts out of the UUID. Do you think this is indeed the most straightforward way to do it?
Kind regards,
Stéphane Thibaud


Hello Ilya,
Yes, you are right. I sent my first response too quickly. With an extra column it will work. I will go with the approach you suggest. :)
Kind regards,
Stéphane Thibaud Hello!
I'm not sure why you think it will not scale. If this field is indexed then taking a random sample is basically one btree walk away.
I guess you will have to store random numbers, if you rely on nonrandom field it might introduce bias.
Regards,
Hello Ilya,
Thank you for that suggestion. On a traditional database I know that approach does not scale well, since a random number is first assigned to all rows (it scales linearly with the number of rows if I am not mistaken). Do you think this would be different for Ignite?
Kind regards,
Stéphane Thibaud
Hello!
You can have a random indexed field in your table and do queries like SELECT * FROM table WHERE rand_field < RAND() LIMIT 1; to sample random item.
Regards,
As a small addition: it would really help if Ignite had a hashing function for this, but I only see AES encryption.
Kind regards,
Stéphane Thibaud Hello Ignite users,
I am considering to sample randomly on large amounts of data, but I was wondering what would be the most efficient way for this. Right now, I think I might need clusterbased randomness using a MOD function as described here: https://www.alandix.com/academic/topics/random/samplingSQL.htmlI currently have a UUID column (uuid4), which I think can be used for it, but I might need some bit manipulation to get the nonrandom parts out of the UUID. Do you think this is indeed the most straightforward way to do it?
Kind regards,
Stéphane Thibaud

