Continuously running jobs

classic Classic list List threaded Threaded
4 messages Options
szabkel szabkel
Reply | Threaded
Open this post in threaded view
|

Continuously running jobs

Info: I used spaces to indent my examples, hope it displayed correctly.
 
I am new to Ignite and I would be really happy, If you could give me some help. I am working on a distributed web crawler in Java8, would like to use ignite to distribute jobs across the available nodes. A single job would run a request on a url and parse some data. I would like to behave well as a crawler, so I would like to time the jobs really precisely. Not requesting the same server (with domain resolution) more than a dynamically changing limit, while continuously thinking about when to go back to a url to keep the data up to date. The software would aim specific domains and urls and different contents needs to be parsed differently (basically it is deep web crawling, scraping), so I image something like this:
Crawler
    Scheduler //which times and broadcasts the jobs
    JobGroups
        SiteXJobGroup //tells how to work on Site X
        SiteYJobGroup //tells how to work on Site Y
        Site...JobGroup

The scheduler would load the information (how often to run the specific jobs) from a database (cron strings?). Runs the jobs parallel to each other (because one job works with a single domain/group of servers and I don't want to burden them with my traffic, a job is advancing slowly, but I can run them in parallel to some extent). I should be able to extend the application later, with new Jobs (every information about the process is stored in database to be persistent and even if everything shuts down, could continue from where it stopped).

My question is: Can I do the scheduling solely with Ignite? Rerunning jobs that already completed after some time, running different jobs in the same time.

A single job (specific page, this would get broadcasted to a node):
job(url argument)
    load the data
    parse the data
    return the data

A group of jobs (a single domain):
while(there is URL in the URLFrontier for the jobs)
    url <- pop url from URLFrontier
    result <- broadcast job(url) //broadcast here is easier, but maybe the Scheduler should
    filtering the result after it came back (do I need it, already seen, contained an error..etc)
    decide if a new URL is needed to add to the URLFrontier based on result (example: next page)

Scheduler would run all the time? I don't really know how to set this up. It should refresh/load which which jobs are needed to run (the jobgroups have an ID or something), start a jobgroup and another beside it to a limit, after one jobgroup is done, start another one, if everything is done, start again.

I think I understand the ComputeContinuousMapperExample in ignite/examles, but I need to run multiple Continuous Mappers, how do I achieve this? ( https://github.com/apache/ignite/blob/master/examples/src/main/java/org/apache/ignite/examples/computegrid/ComputeContinuousMapperExample.java )

Thank you for your help, really, I appreciate it!
szabkel szabkel
Reply | Threaded
Open this post in threaded view
|

Re: Continuously running jobs

What do you think, is Ignite the right tool for me? Am I better off with Akka?
szabkel szabkel
Reply | Threaded
Open this post in threaded view
|

Re: Continuously running jobs

In reply to this post by szabkel
Was I not clear enough? Did I ask for something stupid?
dsetrakyan dsetrakyan
Reply | Threaded
Open this post in threaded view
|

Re: Continuously running jobs

Hi, 

Sorry, somehow I missed this question.

Sounds like Ignite compute grid should be perfect for your task. Ignite load balances the jobs within the cluster automatically, with several load balancing policies available, so you should be able to simply unicast or broadcast closures and have them equally distributed within the cluster. The jobs are simple closures, so you should be able to implement any logic you need.

For more information on the compute grid, scheduling, and load balancing, please refer to Ignite documentation:


D.


On Tue, Mar 8, 2016 at 1:44 PM, KSzabolcs <[hidden email]> wrote:
Was I not clear enough? Did I ask for something stupid?