Delay queue or similar?

classic Classic list List threaded Threaded
5 messages Options
matt matt
Reply | Threaded
Open this post in threaded view
|

Delay queue or similar?

Hi,

I'm working on prototyping a web crawler using Ignite as the crawl-db. I'd
like to ensure the crawler obey's the appropriate Craw-Delay time as set in
a site's robots.txt file - the way I have this setup now, is by submitting
"candidates" to an Ignite cache. A local listener is setup to receive
successfully persisted items, which then submits the items to a queue for a
fetcher to pull from.

Goal: Support a delay time + maximum fetch concurrency, per-host, per-item.

Put another way: "for each fetch item, ensure that requests made to the
associated host are delayed as required, and no more than n-requests are
made during each delayed run".

This could be modeled as a Map<Host,DelayQueue> or maybe even a by using
ScheduledExecutorService where each task represents a host, and is repeated
according to the delay time.

I'd like to prevent items from being put into the java work queue if they
are not yet ready to be fetched, and I'm slightly worried about the
potential number of hosts (in reference to the java Map<Host,...>
data-structure).

So my question is: is there something that Ignite can provide for making
this all work?

- Matt



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Delay queue or similar?

Hello!

I think you could model it with ATOMIC cache:

while (true) {
    long time = cache.get(host);
    if (time < System.currentTimeMillis() && cache.replace(host, time, time + hostDelay) {
        // do request to host
        // break
    else
        // sleep or do other requests in the meantime
}

Regards,
--
Ilya Kasnacheev


вт, 9 окт. 2018 г. в 16:36, matt <[hidden email]>:
Hi,

I'm working on prototyping a web crawler using Ignite as the crawl-db. I'd
like to ensure the crawler obey's the appropriate Craw-Delay time as set in
a site's robots.txt file - the way I have this setup now, is by submitting
"candidates" to an Ignite cache. A local listener is setup to receive
successfully persisted items, which then submits the items to a queue for a
fetcher to pull from.

Goal: Support a delay time + maximum fetch concurrency, per-host, per-item.

Put another way: "for each fetch item, ensure that requests made to the
associated host are delayed as required, and no more than n-requests are
made during each delayed run".

This could be modeled as a Map<Host,DelayQueue> or maybe even a by using
ScheduledExecutorService where each task represents a host, and is repeated
according to the delay time.

I'd like to prevent items from being put into the java work queue if they
are not yet ready to be fetched, and I'm slightly worried about the
potential number of hosts (in reference to the java Map<Host,...>
data-structure).

So my question is: is there something that Ignite can provide for making
this all work?

- Matt



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
matt matt
Reply | Threaded
Open this post in threaded view
|

Re: Delay queue or similar?

Thanks for the feedback, Ilya!

In your example, where would the initial "host" in "long time =
cache.get(host);" come from? In the case I need to solve for, I would not
know what host would be most suitable to make a request to, so would need to
continuously loop over all available keys until the crawl is done. This may
introduce a performance hit, if (for example) the only host that is ready
for a request is the last one in a very large list of keys. Does that make
sense? Apologies if I'm misunderstanding!

- Matt



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Delay queue or similar?

Hello!

You could have secondary (SQL) index on time, and do SELECT ORDER BY time to get most eager hosts.

For initial time, you could 0L as default value. I.e. check for null => use 0L if null.

Regards,

--
Ilya Kasnacheev


чт, 11 окт. 2018 г. в 0:20, matt <[hidden email]>:
Thanks for the feedback, Ilya!

In your example, where would the initial "host" in "long time =
cache.get(host);" come from? In the case I need to solve for, I would not
know what host would be most suitable to make a request to, so would need to
continuously loop over all available keys until the crawl is done. This may
introduce a performance hit, if (for example) the only host that is ready
for a request is the last one in a very large list of keys. Does that make
sense? Apologies if I'm misunderstanding!

- Matt



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
matt matt
Reply | Threaded
Open this post in threaded view
|

Re: Delay queue or similar?

Ok will try that. Cheers!
- Matt



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/