Overview¶
ScrapinghubClient
is a Python client for
communicating with the Scrapinghub API.
First, you instantiate a new client with your Scrapinghub API key:
>>> from scrapinghub import ScrapinghubClient
>>> apikey = '84c87545607a4bc0****************'
>>> client = ScrapinghubClient(apikey)
>>> client
<scrapinghub.client.ScrapinghubClient at 0x1047af2e8>
Working with projects¶
This client instance has a projects
attribute for accessing your projects on Scrapinghub’s platform.
With it, you can list the project IDs available in your account:
>>> client.projects.list()
[123, 456]
Note
.list()
does not return Project
instances, but their numeric IDs.
Or you can get a summary of all your projects (how many jobs are finished, running or pending to be run):
>>> client.projects.summary()
[{'finished': 674,
'has_capacity': True,
'pending': 0,
'project': 123,
'running': 1},
{'finished': 33079,
'has_capacity': True,
'pending': 0,
'project': 456,
'running': 2}]
To work with a particular project, reference it using its numeric ID:
>>> project = client.get_project(123)
>>> project
<scrapinghub.client.Project at 0x106cdd6a0>
>>> project.key
'123'
Note
get_project()
returns a Project
instance.
Tip
The above is a shortcut for client.projects.get(123)
.
Working with spiders¶
A Scrapinghub project (usually) consists of a group of web crawlers called “spiders”.
The different spiders within your project are accessible via the
spiders
attribute of the
Project
instance.
To get the list of spiders in the project, use .spiders.list()
:
>>> project.spiders.list()
[
{'id': 'spider1', 'tags': [], 'type': 'manual', 'version': '123'},
{'id': 'spider2', 'tags': [], 'type': 'manual', 'version': '123'}
]
To select a particular spider to work with, use .spiders.get(<spidername>)
:
>>> spider = project.spiders.get('spider2')
>>> spider
<scrapinghub.client.Spider at 0x106ee3748>
>>> spider.key
'123/2'
>>> spider.name
spider2
With .spiders.get(<spidername>)
, you get a Spider
instance back.
Working with jobs collections¶
Essentially, the purpose of spiders is to be run in Scrapinghub’s platform.
Each spider run is called a “job”.
And a collection of spider jobs is represented by a Jobs
object.
Both project-level jobs (i.e. all jobs from a project) and spider-level jobs
(i.e. all jobs for a specific spider) are available as a jobs
attribute of a
Project
instance
or a Spider
instance respectively.
Running jobs¶
Use the .jobs.run()
method to run a new job for a project or a particular spider,:
>>> job = spider.jobs.run()
You can also use .jobs.run()
at the project level, the difference being that
a spider name is required:
>>> job = project.jobs.run('spider1')
Scheduling jobs supports different options, passed as arguments to .run()
:
- job_args (dict): to provide arguments for the job
- job_settings (dict): to pass additional settings for the job
- units (integer): to specify amount of units to run the job
- priority (integer): to set higher/lower priority for the job
- add_tag (list of strings): to create a job with a set of initial tags
- meta (dict): to pass additional custom metadata
Check the run endpoint for more information.
For example, to run a new job for a given spider with custom parameters:
>>> job = spider.jobs.run(units=2, job_settings={'SETTING': 'VALUE'}, priority=1,
... add_tag=['tagA','tagB'], meta={'custom-data': 'val1'})
Getting job information¶
To select a specific job for a project, use .jobs.get(<jobKey>)
:
>>> job = project.jobs.get('123/1/2')
>>> job.key
'123/1/2'
Also there’s a shortcut to get same job with client instance:
>>> job = client.get_job('123/1/2')
Counting jobs¶
It’s also possible to count jobs for a given project or spider via
.jobs.count()
:
>>> spider.jobs.count()
5
The counting logic supports different filters, as described for count endpoint.
Iterating over jobs¶
To loop over the spider jobs (most recently finished first),
you can use .jobs.iter()
to get an iterator object:
>>> jobs_summary = spider.jobs.iter()
>>> [j['key'] for j in jobs_summary]
['123/1/3', '123/1/2', '123/1/1']
The .jobs.iter()
iterator generates dicts
(not Job
objects), e.g:
{u'close_reason': u'finished',
u'elapsed': 201815620,
u'finished_time': 1492843577852,
u'items': 2,
u'key': u'123320/3/155',
u'logs': 21,
u'pages': 2,
u'pending_time': 1492843520319,
u'running_time': 1492843526622,
u'spider': u'spider001',
u'state': u'finished',
u'ts': 1492843563720,
u'version': u'792458b-master'}
You typically use it like this:
>>> for job in jobs_summary:
... # do something with job data
Or, if you just want to get the job IDs:
>>> [x['key'] for x in jobs_summary]
['123/1/3', '123/1/2', '123/1/1']
The job’s dict fieldset from .jobs.iter()
is less detailed than job.metadata
(see below),
but can contain a few additional fields as well, on demand.
Additional fields can be requested using the jobmeta
argument.
When jobmeta
is used, the user MUST list all required fields,
even default ones:
>>> # by default, the "spider" key is available in the dict from iter()
>>> job_summary = next(project.jobs.iter())
>>> job_summary.get('spider', 'missing')
'foo'
>>>
>>> # when jobmeta is use, if "spider" key is not listed in it,
>>> # iter() will not include "spider" key in the returned dicts
>>> jobs_summary = project.jobs.iter(jobmeta=['scheduled_by'])
>>> job_summary = next(jobs_summary)
>>> job_summary.get('scheduled_by', 'missing')
'John'
>>> job_summary.get('spider', 'missing')
missing
By default .jobs.iter()
returns the last 1000 jobs at most.
To get more than the last 1000, you need to paginate through results
in batches, using the start
parameter:
>>> jobs_summary = spider.jobs.iter(start=1000)
There are several filters like spider
, state
, has_tag
,
lacks_tag
, startts
and endts
(check list endpoint for more details).
To get jobs filtered by tags:
>>> jobs_summary = project.jobs.iter(has_tag=['new', 'verified'], lacks_tag='obsolete')
Warning
The list of tags in has_tag
is an OR condition, so in the case above,
jobs with either 'new'
or 'verified'
tag are selected.
On the contrary the list of tags in lacks_tag
is a logical AND.
To get a specific number of last finished jobs of some spider,
use spider
, state
and count
arguments:
>>> jobs_summary = project.jobs.iter(spider='foo', state='finished', count=3)
There are 4 possible job states, which can be used as (string) values for filtering by state:
'pending'
: the job is scheduled to run when enough units become available;'running'
: the job is running;'finished'
: the job has ended;'deleted'
: the jobs has been deleted and will become unavailable when the platform performs its next cleanup.
Dictionary entries returned by .jobs.iter()
method contain some additional meta,
but can be easily converted to Job
instances with:
>>> [Job(client, x['key']) for x in jobs]
[
<scrapinghub.client.Job at 0x106e2cc18>,
<scrapinghub.client.Job at 0x106e260b8>,
<scrapinghub.client.Job at 0x106e26a20>,
]
Jobs summaries¶
To check jobs summary:
>>> spider.jobs.summary()
[{'count': 0, 'name': 'pending', 'summary': []},
{'count': 0, 'name': 'running', 'summary': []},
{'count': 5,
'name': 'finished',
'summary': [...]}
It’s also possible to get last jobs summary (for each spider):
>>> list(sp.jobs.iter_last())
[{'close_reason': 'success',
'elapsed': 3062444,
'errors': 1,
'finished_time': 1482911633089,
'key': '123/1/3',
'logs': 8,
'pending_time': 1482911596566,
'running_time': 1482911598909,
'spider': 'spider1',
'state': 'finished',
'ts': 1482911615830,
'version': 'some-version'}]
Note that there can be a lot of spiders, so the method above returns an iterator.
Updating tags¶
Tags is a convenient way to mark specific jobs (for better search, postprocessing etc).
To mark all spider jobs with tag consumed
:
>>> spider.jobs.update_tags(add=['consumed'])
To remove existing tag existing
for all spider jobs:
>>> spider.jobs.update_tags(remove=['existing'])
Canceling jobs¶
To cancel a few jobs by keys at once:
>>> spider.jobs.cancel(['123/1/2', '123/1/3'])
All jobs should belong to the same project.
Note that there’s a limit on amount of job keys you can cancel with a single call, please contact support if the amount is more than 1k.
Job actions¶
You can perform actions on a Job
instance.
For example, to cancel a running or pending job, simply call cancel()
on it:
>>> job.cancel()
To delete a job, its metadata, logs and items, call delete()
:
>>> job.delete()
To mark a job with the tag 'consumed'
, call update_tags()
:
>>> job.update_tags(add=['consumed'])
Job data¶
A Job
instance provides access to its
associated data, using the following attributes:
metadata
: various information on the job itself;items
: the data items that the job produced;logs
: log entries that the job produced;requests
: HTTP requests that the job issued;samples
: runtime stats that the job uploaded;
Metadata¶
Metadata about a job details can be accessed via its metadata
attribute.
The corresponding object
acts like a Python dictionary:
>>> job.metadata.get('version')
'5123a86-master'
To check what keys are available (they ultimately depend on the job),
you can use its .iter()
method (here, it’s wrapped inside a dict for readability):
>>> dict(job.metadata.iter())
{...
u'close_reason': u'finished',
u'completed_by': u'jobrunner',
u'deploy_id': 16,
u'finished_time': 1493007370566,
u'job_settings': {u'CLOSESPIDER_PAGECOUNT': 5,
u'SOME_CUSTOM_SETTING': 10},
u'pending_time': 1493006433100,
u'priority': 2,
u'project': 123456,
u'running_time': 1493006488829,
u'scheduled_by': u'periodicjobs',
u'scrapystats': {u'downloader/request_bytes': 96774,
u'downloader/request_count': 228,
u'downloader/request_method_count/GET': 228,
u'downloader/response_bytes': 923251,
u'downloader/response_count': 228,
u'downloader/response_status_count/200': 228,
u'finish_reason': u'finished',
u'finish_time': 1493007337660.0,
u'httpcache/firsthand': 228,
u'httpcache/miss': 228,
u'httpcache/store': 228,
u'item_scraped_count': 684,
u'log_count/INFO': 22,
u'memusage/max': 63311872,
u'memusage/startup': 60248064,
u'request_depth_max': 50,
u'response_received_count': 228,
u'scheduler/dequeued': 228,
u'scheduler/dequeued/disk': 228,
u'scheduler/enqueued': 228,
u'scheduler/enqueued/disk': 228,
u'start_time': 1493006508701.0},
u'spider': u'myspider',
u'spider_args': {u'arg1': u'value1',
u'arg2': u'value2'},
u'spider_type': u'manual',
u'started_by': u'jobrunner',
u'state': u'finished',
u'tags': [],
u'units': 1,
u'version': u'792458b-master'}
As you may have noticed in the example above, if the job was a Scrapy
spider run, the metadata object contains a special 'scrapystats'
key,
which is a dict representation of the crawl’s Scrapy stats
values:
>>> job.metadata.get('scrapystats')
...
'downloader/response_count': 104,
'downloader/response_status_count/200': 104,
'finish_reason': 'finished',
'finish_time': 1447160494937,
'item_scraped_count': 50,
'log_count/DEBUG': 157,
'log_count/INFO': 1365,
'log_count/WARNING': 3,
'memusage/max': 182988800,
'memusage/startup': 62439424,
...
Anything can be stored in a job’s metadata, here is example how to add tags:
>>> job.metadata.set('tags', ['obsolete'])
Items¶
To retrieve all scraped items (as Python dicts) from a job, use
job.items.iter()
:
>>> for item in job.items.iter():
... # do something with item (it's just a dict)
Logs¶
To retrieve all log entries from a job use job.logs.iter()
:
>>> for logitem in job.logs.iter():
... # logitem is a dict with level, message, time
>>> logitem
{
'level': 20,
'message': '[scrapy.core.engine] Closing spider (finished)',
'time': 1482233733976},
}
Requests¶
To retrieve all requests from a job, there’s job.requests.iter()
:
>>> for reqitem in job.requests.iter():
... # reqitem is a dict
>>> reqitem
[{
'duration': 354,
'fp': '6d748741a927b10454c83ac285b002cd239964ea',
'method': 'GET',
'rs': 1270,
'status': 200,
'time': 1482233733870,
'url': 'https://example.com'
}]
Project activity log¶
Project.activity
provides a
convenient interface to project activity events.
To retrieve activity events from a project, you can use .activity.iter()
,
with optional arguments (here, the last 3 events, with timestamp information):
>>> list(project.activity.iter(count=3, meta="_ts"))
[{u'_ts': 1493362000130,
u'event': u'job:completed',
u'job': u'123456/3/161',
u'user': u'jobrunner'},
{u'_ts': 1493361946077,
u'event': u'job:started',
u'job': u'123456/3/161',
u'user': u'jobrunner'},
{u'_ts': 1493361942440,
u'event': u'job:scheduled',
u'job': u'123456/3/161',
u'user': u'periodicjobs'}]
To retrieve all the events, use .activity.list()
>>> project.activity.list()
[{'event': 'job:completed', 'job': '123/2/3', 'user': 'jobrunner'},
{'event': 'job:cancelled', 'job': '123/2/3', 'user': 'john'}]
To post a new activity event, use .activity.add()
:
>>> event = {'event': 'job:completed', 'job': '123/2/4', 'user': 'john'}
>>> project.activity.add(event)
Or post multiple events at once:
>>> events = [
... {'event': 'job:completed', 'job': '123/2/5', 'user': 'john'},
... {'event': 'job:cancelled', 'job': '123/2/6', 'user': 'john'},
... ]
>>> project.activity.add(events)
Collections¶
Scrapinghub’s Collections provide a way to store an arbitrary number of records indexed by a key. They’re often used by Scrapinghub projects as a single place to write information from multiple scraping jobs.
Read more about Collections in the official docs.
As an example, let’s store a hash and timestamp pair for spider ‘foo’.
The usual workflow with project.collections
would be:
- reference your project’s
collections
attribute, - call
.get_store(<somename>)
to create or access the named collection you want (the collection will be created automatically if it doesn’t exist) ; you get a “store” object back, - call
.set(<key/value> pairs)
to store data.
>>> collections = project.collections
>>> foo_store = collections.get_store('foo_store')
>>> foo_store.set({'_key': '002d050ee3ff6192dcbecc4e4b4457d7', 'value': '1447221694537'})
>>> foo_store.count()
1
>>> foo_store.get('002d050ee3ff6192dcbecc4e4b4457d7')
{u'value': u'1447221694537'}
>>> # iterate over _key & value pair
... list(foo_store.iter())
[{u'_key': u'002d050ee3ff6192dcbecc4e4b4457d7', u'value': u'1447221694537'}]
>>> # filter by multiple keys - only values for keys that exist will be returned
... list(foo_store.iter(key=['002d050ee3ff6192dcbecc4e4b4457d7', 'blah']))
[{u'_key': u'002d050ee3ff6192dcbecc4e4b4457d7', u'value': u'1447221694537'}]
>>> foo_store.delete('002d050ee3ff6192dcbecc4e4b4457d7')
>>> foo_store.count()
0
Collections are available at project level only.
Frontiers¶
Typical workflow with Frontiers
:
>>> frontiers = project.frontiers
Get all frontiers from a project to iterate through it:
>>> frontiers.iter()
<list_iterator at 0x103c93630>
List all frontiers:
>>> frontiers.list()
['test', 'test1', 'test2']
Get a Frontier
instance by name:
>>> frontier = frontiers.get('test')
>>> frontier
<scrapinghub.client.Frontier at 0x1048ae4a8>
Get an iterator to iterate through a frontier slots:
>>> frontier.iter()
<list_iterator at 0x1030736d8>
List all slots:
>>> frontier.list()
['example.com', 'example.com2']
Get a FrontierSlot
by name:
>>> slot = frontier.get('example.com')
>>> slot
<scrapinghub.client.FrontierSlot at 0x1049d8978>
Add a request to the slot:
>>> slot.queue.add([{'fp': '/some/path.html'}])
>>> slot.flush()
>>> slot.newcount
1
newcount
is defined per slot, but also available per frontier and globally:
>>> frontier.newcount
1
>>> frontiers.newcount
3
Add a fingerprint only to the slot:
>>> slot.fingerprints.add(['fp1', 'fp2'])
>>> slot.flush()
There are convenient shortcuts: f
for fingerprints
to access
FrontierSlotFingerprints
and q
for
queue
to access FrontierSlotQueue
.
Add requests with additional parameters:
>>> slot.q.add([{'fp': '/'}, {'fp': 'page1.html', 'p': 1, 'qdata': {'depth': 1}}])
>>> slot.flush()
To retrieve all requests for a given slot:
>>> reqs = slot.q.iter()
To retrieve all fingerprints for a given slot:
>>> fps = slot.f.iter()
To list all the requests use list()
method (similar for fingerprints
):
>>> fps = slot.q.list()
To delete a batch of requests:
>>> slot.q.delete('00013967d8af7b0001')
To delete the whole slot from the frontier:
>>> slot.delete()
Flush data of the given frontier:
>>> frontier.flush()
Flush data of all frontiers of a project:
>>> frontiers.flush()
Close batch writers of all frontiers of a project:
>>> frontiers.close()
Frontiers are available on project level only.
Settings¶
You can work with project settings via Settings
.
To get a list of the project settings:
>>> project.settings.list()
[(u'default_job_units', 2), (u'job_runtime_limit', 24)]]
To get a project setting value by name:
>>> project.settings.get('job_runtime_limit')
24
To update a project setting value by name:
>>> project.settings.set('job_runtime_limit', 20)
Or update a few project settings at once:
>>> project.settings.update({'default_job_units': 1,
... 'job_runtime_limit': 20})
Exceptions¶
-
exception
scrapinghub.
ScrapinghubAPIError
(message=None, http_error=None)¶ Base exception class.
-
exception
scrapinghub.
BadRequest
(message=None, http_error=None)¶ Usually raised in case of 400 response from API.
Request lacks valid authentication credentials for the target resource.
-
exception
scrapinghub.
NotFound
(message=None, http_error=None)¶ Entity doesn’t exist (e.g. spider or project).
-
exception
scrapinghub.
ValueTooLarge
(message=None, http_error=None)¶ Value cannot be writtent because it exceeds size limits.
-
exception
scrapinghub.
DuplicateJobError
(message=None, http_error=None)¶ Job for given spider with given arguments is already scheduled or running.
-
exception
scrapinghub.
ServerError
(message=None, http_error=None)¶ Indicates some server error: something unexpected has happened.