Machine Learning questions

classic Classic list List threaded Threaded
2 messages Options
joseheitor joseheitor
Reply | Threaded
Open this post in threaded view
|

Machine Learning questions

Hi Guys,

A few questions as I progress through my ML learning journey with Ignite...

- I assume that I would start by extracting features from my JSON records in
a cache into a vectorizer - how does this impact memory usage? Will origin
cache records be moved to disk, as more memory is required than is available
for the data in the vectorizer? Or will the vectorizer data begin to use
swap? Or will I get OOM exceptions?

- Are there any built-in algorithms or recommended strategies for sampling?

- Are there any dataset statistical functions like those provided by
Python's ML libraries, for high-level evaluation of specific features in a
dataset (to assess things like missing-data, cardinality, min-max, mean,
mode, standard-deviation, percentiles, etc)?

- Is there any doc/video tutorial that would provide a guide for the
complete workflow pipeline for an ML example (encompassing the
abovementioned operations)?

Thanks,
Jose



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
zaleslaw zaleslaw
Reply | Threaded
Open this post in threaded view
|

Re: Machine Learning questions

SPOILER: I need to say that the release 2.8 will be published after New Year
and all answers will be related to the new release.

If we talk to 2.8 release (the last update of ML functionality in master and
release branch)

 +++ I assume that I would start by extracting features from my JSON records
in
a cache into a vectorizer - how does this impact memory usage? +++

The answer is here:
https://apacheignite.readme.io/docs/ml-partition-based-dataset

The cache will be in memory and additional data will be located in heap
too(but not in caches but near)
Of course, more memory is required (depends on training algorithm)

If heap is small you have a chance to get and OOM

+++Are there any built-in algorithms or recommended strategies for
sampling+++
Please have a look here
https://github.com/apache/ignite/blob/master/examples/src/main/java/org/apache/ignite/examples/ml/tutorial/Step_7_Split_train_test.java

You could use the same mechanism to get the random sample

But the have no sampling tool as is to get the sample rows from cache. It is
not a part of ML functionality now.

+++ Are there any dataset statistical functions like those provided by
Python's ML libraries, for high-level evaluation of specific features in a
dataset (to assess things like missing-data, cardinality, min-max, mean,
mode, standard-deviation, percentiles, etc)? +++

We are not manipulate directly the data in caches, the build new data in new
format for training purposes, but we doesn't support in ML pandas-like
operations.

We have preprocessing algorithms, but they could be used as a first step in
training Pipeline
https://apacheignite.readme.io/docs/preprocessing

Hope that in 2.9 summary for the dataset and a few stats (like described
above) will be added.

+++ - Is there any doc/video tutorial that would provide a guide for the
complete workflow pipeline for an ML example (encompassing the
abovementioned operations)? +++

First of all, please have a look to the Titanic Tutorial
https://github.com/apache/ignite/tree/master/examples/src/main/java/org/apache/ignite/examples/ml/tutorial
and another examples
https://github.com/apache/ignite/tree/master/examples/src/main/java/org/apache/ignite/examples/ml

Also a few videos are available in my channel
https://www.youtube.com/watch?v=3CmnV6IQtTw
https://www.youtube.com/watch?v=DmoMBsiHxf8

Jose, great questions, hope to share more docs and papers about Ignite ML
after New Year and 2.8 release.








--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/