Running Scrapers¶
Once you've written your scraper scripts, you'll need to re-run them frequently to keep the news on your site up to date.
You can do this any way you like; cron would work fine.
There is also a daemon that comes with OpenBlock which is tailor-made for this purpose, although we're deprecating it as of the 1.1 release since there's no real reason for OpenBlock to reinvent this particular wheel.
Cron Configuration¶
Here's an example config file for running scrapers via cron.
important!
You must set the DJANGO_SETTINGS_MODULE
environment variable,
and use the python interpreter that lives in your virtualenv. It's also crucial that the user who runs each
cron job have permission to run those scripts, and permission to
write to any log files, etc. that the scrapers write to. I recommend
using the same (non-root) user account you used for installing
openblock.
# Put this in
SHELL=/bin/bash
# Edit these as necessary
DJANGO_SETTINGS_MODULE=obdemo.settings
SCRAPERS=/path/to/ebdata/scrapers
BINDIR=/path/to/virtualenv/bin
PYTHON=/path/to/virtualenv/bin/python
USER=openblock
# Where do errors get emailed?
MAILTO=somebody@example.com
# Format:
# m h dom mon dow user command
# Retrieve flickr photos every 20 minutes.
0,20,40 * * * * $USER $PYTHON $SCRAPERS/general/flickr/flickr_retrieval.py -q
# Meetup can be slow due to hitting rate limits.
# Several times a day should be OK.
0 7,18,22 * * * $PYTHON $SCRAPERS/general/meetup/meetup_retrieval.py -q
# Aggregates every 6 min.
*/6 0 0 0 0 $USER $BINDIR/update_aggregates --quiet
A more extensive example is in the obdemo
source code; look for sample_crontab
.
As noted in Data Scraper Tutorial, it's a very good idea if scripts have a command-line option to discard all non-error output, since cron likes to email you with all output. When using cron, silence is golden.
Updaterdaemon Configuration¶
Deprecated!
Since cron
and similar tools work just fine,
we're declaring Updaterdaemon deprecated; that is,
we no longer recommend using it.
The daemon script is named runner.py
and it lives in
ebdata, more specifically at ebdata/retrieval/updaterdaemon/runner.py
. To configure it, you need to write a (small)
Python script that contains a list of TASKS
.
There is an example config file at
ebdata/retrieval/updaterdaemon/config.py
,
and the one we use for obdemo is at obdemo/sample_scraper_config.py
.
What goes in the config file? Let's put together a (small) example based on the one for obdemo.
First, we need a function that imports and runs one of our scrapers,
just once. Let's use the one from obdemo
that creates
"Events". Our function can look like:
def do_events():
from obdemo.scrapers.add_events import main
return main()
(Note that this function could do anything we want to run periodically; updaterdaemon actually doesn't know anything about scrapers per se. One other thing you probably want to do regularly is send out openblock's E-mail alerts.)
Next, we need a way to know when, or how often, that function should
run. We'll use another function for that; let's call it a "time
callback". The time callback takes one argument - a Python datetime -
and returns True
if we should run our scraper now, and False
otherwise.
Here's one that runs every ten minutes:
def every_ten_minutes(datetime):
if datetime.minute % 10 == 0:
return True
return False
(Note that runner.py only wakes up and checks the time once per minute, so we don't need to be very careful here about the time check - we won't accidentally run this many times in one minute.)
(Note also that the example config file in
ebdata/retrieval/updaterdaemon/config.py
already contains
factories to generate a number of useful time callbacks, such as
multiple_daily
, daily,
and weekly
. We could just import
and call one of those. Read the source to see how they work.)
Finally, we need to wrap all this up in a list (or tuple) calles
TASKS
. This is what the runner.py script looks for in the config
file. TASKS
is a list of tuples, each in the form
(time_callback, function_to_run, {keyword args}, {environ})
.
We've already got the first two of those. What about the last two?
keyword args
is a dictionary of extra arguments to pass to our
function. Ours doesn't actually need any, so we'll use an empty
dictionary, like {}
.
environ
is a dictionary of environment variables to set before
running our function. Generally this will need to set
DJANGO_SETTINGS_MODULE
. For the demo, we set it to
obdemo.settings
by default, unless there is already an environment
variable by that name. This looks like:
env = {'DJANGO_SETTINGS_MODULE': os.environ.get('DJANGO_SETTINGS_MODULE', 'obdemo.settings')}
Putting it all together, we get this complete config file:
from ebdata.retrieval.updaterdaemon.config import multiple_hourly
def do_events():
from obdemo.scrapers.add_events import main
return main()
def every_ten_minutes(datetime):
if datetime.minute % 10 == 0:
return True
return False
env = {'DJANGO_SETTINGS_MODULE': os.environ.get('DJANGO_SETTINGS_MODULE', 'obdemo.settings')}
TASKS = (
(every_ten_minutes, do_events, {}, env),
)
Testing the daemon¶
Give it a try:
$ python ebdata/ebdata/retrieval/updaterdaemon/runner.py --config=/path/to/config.py start
If it works, nothing obvious should happen :) It's running in the background. You shouldn't expect anything to happen until the next multiple of 10 minutes. When it's time, check the log file to see if anything's happening:
$ tail -f /tmp/updaterdaemon.log
(Hit Ctrl-C to get out of that.)
If there's nothing in the main log, check the error log:
$ less /tmp/updaterdaemon.err
To stop the daemon, do this:
$ python ebdata/ebdata/retrieval/updaterdaemon/runner.py stop
Installing the init script¶
UpdaterDaemon also comes with a script suitable for putting in
/etc/init.d
, so it will be restarted whenever the system is
rebooted. To install this script, copy it from
ebdata/retrieval/updaterdaemon/initscript
into something like
/etc/init.d/openblock-updaterdaemon
. It is known to work on
Ubuntu; let us know if you have trouble with it on other linux
systems.
After copying, edit the script, setting a few crucial environment variables:
HERE
should point to the virtualenv where you installed OpenBlock.
CONFIG
should point to a config file as described in the previous
sections.
SU_USER
should be the name of the user account to use for running
the daemon.
You might also want to set LOGFILE
and ERRLOGFILE
to control
where the logs go.
Now try running the script as root:
$ sudo /etc/init.d/openblock-updaterdaemon start
Check the log files to make sure it's working.