Overview¶

ScrapinghubClient is a Python client for communicating with the Scrapinghub API.

First, you instantiate a new client with your Scrapinghub API key:

>>> from scrapinghub import ScrapinghubClient
>>> apikey = '84c87545607a4bc0****************'
>>> client = ScrapinghubClient(apikey)
>>> client
<scrapinghub.client.ScrapinghubClient at 0x1047af2e8>

Working with projects¶

This client instance has a projects attribute for accessing your projects on Scrapinghub’s platform.

With it, you can list the project IDs available in your account:

>>> client.projects.list()
[123, 456]

Note

.list() does not return Project instances, but their numeric IDs.

Or you can get a summary of all your projects (how many jobs are finished, running or pending to be run):

>>> client.projects.summary()
[{'finished': 674,
  'has_capacity': True,
  'pending': 0,
  'project': 123,
  'running': 1},
 {'finished': 33079,
  'has_capacity': True,
  'pending': 0,
  'project': 456,
  'running': 2}]

To work with a particular project, reference it using its numeric ID:

>>> project = client.get_project(123)
>>> project
<scrapinghub.client.Project at 0x106cdd6a0>
>>> project.key
'123'

Note

get_project() returns a Project instance.

Tip

The above is a shortcut for client.projects.get(123).

Working with spiders¶

A Scrapinghub project (usually) consists of a group of web crawlers called “spiders”.

The different spiders within your project are accessible via the spiders attribute of the Project instance.

To get the list of spiders in the project, use .spiders.list():

>>> project.spiders.list()
[
  {'id': 'spider1', 'tags': [], 'type': 'manual', 'version': '123'},
  {'id': 'spider2', 'tags': [], 'type': 'manual', 'version': '123'}
]

To select a particular spider to work with, use .spiders.get(<spidername>):

>>> spider = project.spiders.get('spider2')
>>> spider
<scrapinghub.client.Spider at 0x106ee3748>
>>> spider.key
'123/2'
>>> spider.name
spider2

With .spiders.get(<spidername>), you get a Spider instance back.

Note

.spiders.list() does not return Spider instances. The id key in the returned dicts corresponds to the .name attribute of Spider that you get with .spiders.get(<spidername>).

Working with jobs collections¶

Essentially, the purpose of spiders is to be run in Scrapinghub’s platform. Each spider run is called a “job”. And a collection of spider jobs is represented by a Jobs object.

Both project-level jobs (i.e. all jobs from a project) and spider-level jobs (i.e. all jobs for a specific spider) are available as a jobs attribute of a Project instance or a Spider instance respectively.

Running jobs¶

Use the .jobs.run() method to run a new job for a project or a particular spider,:

>>> job = spider.jobs.run()

You can also use .jobs.run() at the project level, the difference being that a spider name is required:

>>> job = project.jobs.run('spider1')

Scheduling jobs supports different options, passed as arguments to .run():

job_args (dict): to provide arguments for the job
job_settings (dict): to pass additional settings for the job
units (integer): to specify amount of units to run the job
priority (integer): to set higher/lower priority for the job
add_tag (list of strings): to create a job with a set of initial tags
meta (dict): to pass additional custom metadata

Check the run endpoint for more information.

For example, to run a new job for a given spider with custom parameters:

>>> job = spider.jobs.run(units=2, job_settings={'SETTING': 'VALUE'}, priority=1,
...                       add_tag=['tagA','tagB'], meta={'custom-data': 'val1'})

Getting job information¶

To select a specific job for a project, use .jobs.get(<jobKey>):

>>> job = project.jobs.get('123/1/2')
>>> job.key
'123/1/2'

Also there’s a shortcut to get same job with client instance:

>>> job = client.get_job('123/1/2')

These methods return a Job instance (see below).

Counting jobs¶

It’s also possible to count jobs for a given project or spider via .jobs.count():

>>> spider.jobs.count()
5

The counting logic supports different filters, as described for count endpoint.

Iterating over jobs¶

To loop over the spider jobs (most recently finished first), you can use .jobs.iter() to get an iterator object:

>>> jobs_summary = spider.jobs.iter()
>>> [j['key'] for j in jobs_summary]
['123/1/3', '123/1/2', '123/1/1']

The .jobs.iter() iterator generates dicts (not Job objects), e.g:

{u'close_reason': u'finished',
 u'elapsed': 201815620,
 u'finished_time': 1492843577852,
 u'items': 2,
 u'key': u'123320/3/155',
 u'logs': 21,
 u'pages': 2,
 u'pending_time': 1492843520319,
 u'running_time': 1492843526622,
 u'spider': u'spider001',
 u'state': u'finished',
 u'ts': 1492843563720,
 u'version': u'792458b-master'}

You typically use it like this:

>>> for job in jobs_summary:
...     # do something with job data

Or, if you just want to get the job IDs:

>>> [x['key'] for x in jobs_summary]
['123/1/3', '123/1/2', '123/1/1']

The job’s dict fieldset from .jobs.iter() is less detailed than job.metadata (see below), but can contain a few additional fields as well, on demand. Additional fields can be requested using the jobmeta argument.

When jobmeta is used, the user MUST list all required fields, even default ones:

>>> # by default, the "spider" key is available in the dict from iter()
>>> job_summary = next(project.jobs.iter())
>>> job_summary.get('spider', 'missing')
'foo'
>>>
>>> # when jobmeta is use, if "spider" key is not listed in it,
>>> # iter() will not include "spider" key in the returned dicts
>>> jobs_summary = project.jobs.iter(jobmeta=['scheduled_by'])
>>> job_summary = next(jobs_summary)
>>> job_summary.get('scheduled_by', 'missing')
'John'
>>> job_summary.get('spider', 'missing')
missing

By default .jobs.iter() returns the last 1000 jobs at most. To get more than the last 1000, you need to paginate through results in batches, using the start parameter:

>>> jobs_summary = spider.jobs.iter(start=1000)

There are several filters like spider, state, has_tag, lacks_tag, startts and endts (check list endpoint for more details).

To get jobs filtered by tags:

>>> jobs_summary = project.jobs.iter(has_tag=['new', 'verified'], lacks_tag='obsolete')

Warning

The list of tags in has_tag is an OR condition, so in the case above, jobs with either 'new' or 'verified' tag are selected.

On the contrary the list of tags in lacks_tag is a logical AND.

To get a specific number of last finished jobs of some spider, use spider, state and count arguments:

>>> jobs_summary = project.jobs.iter(spider='foo', state='finished', count=3)

There are 4 possible job states, which can be used as (string) values for filtering by state:

'pending': the job is scheduled to run when enough units become available;
'running': the job is running;
'finished': the job has ended;
'deleted': the jobs has been deleted and will become unavailable when the platform performs its next cleanup.

Dictionary entries returned by .jobs.iter() method contain some additional meta, but can be easily converted to Job instances with:

>>> [Job(client, x['key']) for x in jobs]
[
  <scrapinghub.client.Job at 0x106e2cc18>,
  <scrapinghub.client.Job at 0x106e260b8>,
  <scrapinghub.client.Job at 0x106e26a20>,
]

Jobs summaries¶

To check jobs summary:

>>> spider.jobs.summary()
[{'count': 0, 'name': 'pending', 'summary': []},
 {'count': 0, 'name': 'running', 'summary': []},
 {'count': 5,
  'name': 'finished',
  'summary': [...]}

It’s also possible to get last jobs summary (for each spider):

>>> list(sp.jobs.iter_last())
[{'close_reason': 'success',
  'elapsed': 3062444,
  'errors': 1,
  'finished_time': 1482911633089,
  'key': '123/1/3',
  'logs': 8,
  'pending_time': 1482911596566,
  'running_time': 1482911598909,
  'spider': 'spider1',
  'state': 'finished',
  'ts': 1482911615830,
  'version': 'some-version'}]

Note that there can be a lot of spiders, so the method above returns an iterator.

Updating tags¶

Tags is a convenient way to mark specific jobs (for better search, postprocessing etc).

To mark all spider jobs with tag consumed:

>>> spider.jobs.update_tags(add=['consumed'])

To remove existing tag existing for all spider jobs:

>>> spider.jobs.update_tags(remove=['existing'])

Modifying tags is available at Spider level and Job level.

Canceling jobs¶

To cancel a few jobs by keys at once:

>>> spider.jobs.cancel(['123/1/2', '123/1/3'])

All jobs should belong to the same project.

Note that there’s a limit on amount of job keys you can cancel with a single call, please contact support if the amount is more than 1k.

Job actions¶

You can perform actions on a Job instance.

For example, to cancel a running or pending job, simply call cancel() on it:

>>> job.cancel()

To delete a job, its metadata, logs and items, call delete():

>>> job.delete()

To mark a job with the tag 'consumed', call update_tags():

>>> job.update_tags(add=['consumed'])

Job data¶

A Job instance provides access to its associated data, using the following attributes:

metadata: various information on the job itself;
items: the data items that the job produced;
logs: log entries that the job produced;
requests: HTTP requests that the job issued;
samples: runtime stats that the job uploaded;

Metadata¶

Metadata about a job details can be accessed via its metadata attribute. The corresponding object acts like a Python dictionary:

>>> job.metadata.get('version')
'5123a86-master'

To check what keys are available (they ultimately depend on the job), you can use its .iter() method (here, it’s wrapped inside a dict for readability):

>>> dict(job.metadata.iter())
{...
 u'close_reason': u'finished',
 u'completed_by': u'jobrunner',
 u'deploy_id': 16,
 u'finished_time': 1493007370566,
 u'job_settings': {u'CLOSESPIDER_PAGECOUNT': 5,
                   u'SOME_CUSTOM_SETTING': 10},
 u'pending_time': 1493006433100,
 u'priority': 2,
 u'project': 123456,
 u'running_time': 1493006488829,
 u'scheduled_by': u'periodicjobs',
 u'scrapystats': {u'downloader/request_bytes': 96774,
                  u'downloader/request_count': 228,
                  u'downloader/request_method_count/GET': 228,
                  u'downloader/response_bytes': 923251,
                  u'downloader/response_count': 228,
                  u'downloader/response_status_count/200': 228,
                  u'finish_reason': u'finished',
                  u'finish_time': 1493007337660.0,
                  u'httpcache/firsthand': 228,
                  u'httpcache/miss': 228,
                  u'httpcache/store': 228,
                  u'item_scraped_count': 684,
                  u'log_count/INFO': 22,
                  u'memusage/max': 63311872,
                  u'memusage/startup': 60248064,
                  u'request_depth_max': 50,
                  u'response_received_count': 228,
                  u'scheduler/dequeued': 228,
                  u'scheduler/dequeued/disk': 228,
                  u'scheduler/enqueued': 228,
                  u'scheduler/enqueued/disk': 228,
                  u'start_time': 1493006508701.0},
 u'spider': u'myspider',
 u'spider_args': {u'arg1': u'value1',
                  u'arg2': u'value2'},
 u'spider_type': u'manual',
 u'started_by': u'jobrunner',
 u'state': u'finished',
 u'tags': [],
 u'units': 1,
 u'version': u'792458b-master'}

As you may have noticed in the example above, if the job was a Scrapy spider run, the metadata object contains a special 'scrapystats' key, which is a dict representation of the crawl’s Scrapy stats values:

>>> job.metadata.get('scrapystats')
...
'downloader/response_count': 104,
'downloader/response_status_count/200': 104,
'finish_reason': 'finished',
'finish_time': 1447160494937,
'item_scraped_count': 50,
'log_count/DEBUG': 157,
'log_count/INFO': 1365,
'log_count/WARNING': 3,
'memusage/max': 182988800,
'memusage/startup': 62439424,
...

Anything can be stored in a job’s metadata, here is example how to add tags:

>>> job.metadata.set('tags', ['obsolete'])

Items¶

To retrieve all scraped items (as Python dicts) from a job, use job.items.iter():

>>> for item in job.items.iter():
...     # do something with item (it's just a dict)

Logs¶

To retrieve all log entries from a job use job.logs.iter():

>>> for logitem in job.logs.iter():
...     # logitem is a dict with level, message, time
>>> logitem
{
  'level': 20,
  'message': '[scrapy.core.engine] Closing spider (finished)',
  'time': 1482233733976},
}

Requests¶

To retrieve all requests from a job, there’s job.requests.iter():

>>> for reqitem in job.requests.iter():
...     # reqitem is a dict
>>> reqitem
[{
  'duration': 354,
  'fp': '6d748741a927b10454c83ac285b002cd239964ea',
  'method': 'GET',
  'rs': 1270,
  'status': 200,
  'time': 1482233733870,
  'url': 'https://example.com'
}]

Project activity log¶

Project.activity provides a convenient interface to project activity events.

To retrieve activity events from a project, you can use .activity.iter(), with optional arguments (here, the last 3 events, with timestamp information):

>>> list(project.activity.iter(count=3, meta="_ts"))
[{u'_ts': 1493362000130,
  u'event': u'job:completed',
  u'job': u'123456/3/161',
  u'user': u'jobrunner'},
 {u'_ts': 1493361946077,
  u'event': u'job:started',
  u'job': u'123456/3/161',
  u'user': u'jobrunner'},
 {u'_ts': 1493361942440,
  u'event': u'job:scheduled',
  u'job': u'123456/3/161',
  u'user': u'periodicjobs'}]

To retrieve all the events, use .activity.list()

>>> project.activity.list()
[{'event': 'job:completed', 'job': '123/2/3', 'user': 'jobrunner'},
 {'event': 'job:cancelled', 'job': '123/2/3', 'user': 'john'}]

To post a new activity event, use .activity.add():

>>> event = {'event': 'job:completed', 'job': '123/2/4', 'user': 'john'}
>>> project.activity.add(event)

Or post multiple events at once:

>>> events = [
...     {'event': 'job:completed', 'job': '123/2/5', 'user': 'john'},
...     {'event': 'job:cancelled', 'job': '123/2/6', 'user': 'john'},
... ]
>>> project.activity.add(events)

Collections¶

Scrapinghub’s Collections provide a way to store an arbitrary number of records indexed by a key. They’re often used by Scrapinghub projects as a single place to write information from multiple scraping jobs.

Read more about Collections in the official docs.

As an example, let’s store a hash and timestamp pair for spider ‘foo’.

The usual workflow with project.collections would be:

reference your project’s collections attribute,
call .get_store(<somename>) to create or access the named collection you want (the collection will be created automatically if it doesn’t exist) ; you get a “store” object back,
call .set(<key/value> pairs) to store data.

>>> collections = project.collections
>>> foo_store = collections.get_store('foo_store')
>>> foo_store.set({'_key': '002d050ee3ff6192dcbecc4e4b4457d7', 'value': '1447221694537'})
>>> foo_store.count()
1
>>> foo_store.get('002d050ee3ff6192dcbecc4e4b4457d7')
{u'value': u'1447221694537'}
>>> # iterate over _key & value pair
... list(foo_store.iter())
[{u'_key': u'002d050ee3ff6192dcbecc4e4b4457d7', u'value': u'1447221694537'}]
>>> # filter by multiple keys - only values for keys that exist will be returned
... list(foo_store.iter(key=['002d050ee3ff6192dcbecc4e4b4457d7', 'blah']))
[{u'_key': u'002d050ee3ff6192dcbecc4e4b4457d7', u'value': u'1447221694537'}]
>>> foo_store.delete('002d050ee3ff6192dcbecc4e4b4457d7')
>>> foo_store.count()
0

Collections are available at project level only.

Frontiers¶

Typical workflow with Frontiers:

>>> frontiers = project.frontiers

Get all frontiers from a project to iterate through it:

>>> frontiers.iter()
<list_iterator at 0x103c93630>

List all frontiers:

>>> frontiers.list()
['test', 'test1', 'test2']

Get a Frontier instance by name:

>>> frontier = frontiers.get('test')
>>> frontier
<scrapinghub.client.Frontier at 0x1048ae4a8>

Get an iterator to iterate through a frontier slots:

>>> frontier.iter()
<list_iterator at 0x1030736d8>

List all slots:

>>> frontier.list()
['example.com', 'example.com2']

Get a FrontierSlot by name:

>>> slot = frontier.get('example.com')
>>> slot
<scrapinghub.client.FrontierSlot at 0x1049d8978>

Add a request to the slot:

>>> slot.queue.add([{'fp': '/some/path.html'}])
>>> slot.flush()
>>> slot.newcount
1

newcount is defined per slot, but also available per frontier and globally:

>>> frontier.newcount
1
>>> frontiers.newcount
3

Add a fingerprint only to the slot:

>>> slot.fingerprints.add(['fp1', 'fp2'])
>>> slot.flush()

There are convenient shortcuts: f for fingerprints to access FrontierSlotFingerprints and q for queue to access FrontierSlotQueue.

Add requests with additional parameters:

>>> slot.q.add([{'fp': '/'}, {'fp': 'page1.html', 'p': 1, 'qdata': {'depth': 1}}])
>>> slot.flush()

To retrieve all requests for a given slot:

>>> reqs = slot.q.iter()

To retrieve all fingerprints for a given slot:

>>> fps = slot.f.iter()

To list all the requests use list() method (similar for fingerprints):

>>> fps = slot.q.list()

To delete a batch of requests:

>>> slot.q.delete('00013967d8af7b0001')

To delete the whole slot from the frontier:

>>> slot.delete()

Flush data of the given frontier:

>>> frontier.flush()

Flush data of all frontiers of a project:

>>> frontiers.flush()

Close batch writers of all frontiers of a project:

>>> frontiers.close()

Frontiers are available on project level only.

Settings¶

You can work with project settings via Settings.

To get a list of the project settings:

>>> project.settings.list()
[(u'default_job_units', 2), (u'job_runtime_limit', 24)]]

To get a project setting value by name:

>>> project.settings.get('job_runtime_limit')
24

To update a project setting value by name:

>>> project.settings.set('job_runtime_limit', 20)

Or update a few project settings at once:

>>> project.settings.update({'default_job_units': 1,
...                          'job_runtime_limit': 20})

Exceptions¶

exception scrapinghub.ScrapinghubAPIError(message=None, http_error=None)¶: Base exception class.

exception scrapinghub.BadRequest(message=None, http_error=None)¶: Usually raised in case of 400 response from API.

exception scrapinghub.Unauthorized(message=None, http_error=None)¶: Request lacks valid authentication credentials for the target resource.

exception scrapinghub.NotFound(message=None, http_error=None)¶: Entity doesn’t exist (e.g. spider or project).

exception scrapinghub.ValueTooLarge(message=None, http_error=None)¶: Value cannot be writtent because it exceeds size limits.

exception scrapinghub.DuplicateJobError(message=None, http_error=None)¶: Job for given spider with given arguments is already scheduled or running.

exception scrapinghub.ServerError(message=None, http_error=None)¶: Indicates some server error: something unexpected has happened.