Info: I used spaces to indent my examples, hope it displayed correctly.
I am new to Ignite and I would be really happy, If you could give me some help. I am working on a distributed web crawler in Java8, would like to use ignite to distribute jobs across the available nodes. A single job would run a request on a url and parse some data. I would like to behave well as a crawler, so I would like to time the jobs really precisely. Not requesting the same server (with domain resolution) more than a dynamically changing limit, while continuously thinking about when to go back to a url to keep the data up to date. The software would aim specific domains and urls and different contents needs to be parsed differently (basically it is deep web crawling, scraping), so I image something like this:
Scheduler //which times and broadcasts the jobs
SiteXJobGroup //tells how to work on Site X
SiteYJobGroup //tells how to work on Site Y
The scheduler would load the information (how often to run the specific jobs) from a database (cron strings?). Runs the jobs parallel to each other (because one job works with a single domain/group of servers and I don't want to burden them with my traffic, a job is advancing slowly, but I can run them in parallel to some extent). I should be able to extend the application later, with new Jobs (every information about the process is stored in database to be persistent and even if everything shuts down, could continue from where it stopped).
My question is: Can I do the scheduling solely with Ignite? Rerunning jobs that already completed after some time, running different jobs in the same time.
A single job (specific page, this would get broadcasted to a node):
load the data
parse the data
return the data
A group of jobs (a single domain):
while(there is URL in the URLFrontier for the jobs)
url <- pop url from URLFrontier
result <- broadcast job(url) //broadcast here is easier, but maybe the Scheduler should
filtering the result after it came back (do I need it, already seen, contained an error..etc)
decide if a new URL is needed to add to the URLFrontier based on result (example: next page)
Scheduler would run all the time? I don't really know how to set this up. It should refresh/load which which jobs are needed to run (the jobgroups have an ID or something), start a jobgroup and another beside it to a limit, after one jobgroup is done, start another one, if everything is done, start again.
Sounds like Ignite compute grid should be perfect for your task. Ignite load balances the jobs within the cluster automatically, with several load balancing policies available, so you should be able to simply unicast or broadcast closures and have them equally distributed within the cluster. The jobs are simple closures, so you should be able to implement any logic you need.
For more information on the compute grid, scheduling, and load balancing, please refer to Ignite documentation: