API Reference¶
Client object¶
-
class
scrapinghub.client.
ScrapinghubClient
(auth=None, dash_endpoint=None, connection_timeout=60, **kwargs)¶ Main class to work with Scrapinghub API.
Parameters: - auth – (optional) Scrapinghub APIKEY or other SH auth credentials.
If not provided, it will read, respectively, from
SH_APIKEY
orSHUB_JOBAUTH
environment variables.SHUB_JOBAUTH
is available by default in Scrapy Cloud, but it does not provide access to all endpoints (e.g. job scheduling), but it is allowed to access job data, collections, crawl frontier. If you need full access to Scrapy Cloud features, you’ll need to provide a Scrapinghub APIKEY through this argument or deployingSH_APIKEY
. - dash_endpoint – (optional) Scrapinghub Dash panel url.
- **kwargs – (optional) Additional arguments for
HubstorageClient
constructor.
Variables: projects – projects collection,
Projects
instance.Usage:
>>> from scrapinghub import ScrapinghubClient >>> client = ScrapinghubClient('APIKEY') >>> client <scrapinghub.client.ScrapinghubClient at 0x1047af2e8>
-
close
(timeout=None)¶ Close client instance.
Parameters: timeout – (optional) float timeout secs to stop gracefully.
-
get_job
(job_key)¶ Get
Job
with a given job key.Parameters: job_key – job key string in format project_id/spider_id/job_id
, where all the components are integers.Returns: a job instance. Return type: Job
Usage:
>>> job = client.get_job('123/1/1') >>> job <scrapinghub.client.jobs.Job at 0x10afe2eb1>
-
get_project
(project_id)¶ Get
scrapinghub.client.projects.Project
instance with a given project id.The method is a shortcut for client.projects.get().
Parameters: project_id – integer or string numeric project id. Returns: a project instance. Return type: Project
Usage:
>>> project = client.get_project(123) >>> project <scrapinghub.client.projects.Project at 0x106cdd6a0>
- auth – (optional) Scrapinghub APIKEY or other SH auth credentials.
If not provided, it will read, respectively, from
Activity¶
-
class
scrapinghub.client.activity.
Activity
(cls, client, key)¶ Representation of collection of job activity events.
Not a public constructor: use
Project
instance to get aActivity
instance. Seeactivity
attribute.Please note that
list()
method can use a lot of memory and for a large amount of activities it’s recommended to iterate through it viaiter()
method (all params and available filters are same for both methods).Usage:
get all activity from a project:
>>> project.activity.iter() <generator object jldecode at 0x1049ee990>
get only last 2 events from a project:
>>> project.activity.list(count=2) [{'event': 'job:completed', 'job': '123/2/3', 'user': 'jobrunner'}, {'event': 'job:started', 'job': '123/2/3', 'user': 'john'}]
post a new event:
>>> event = {'event': 'job:completed', ... 'job': '123/2/4', ... 'user': 'jobrunner'} >>> project.activity.add(event)
post multiple events at once:
>>> events = [ ... {'event': 'job:completed', 'job': '123/2/5', 'user': 'jobrunner'}, ... {'event': 'job:cancelled', 'job': '123/2/6', 'user': 'john'}, ... ] >>> project.activity.add(events)
-
add
(values, **kwargs)¶ Add new event to the project activity.
Parameters: values – a single event or a list of events, where event is represented with a dictionary of (‘event’, ‘job’, ‘user’) keys.
-
iter
(count=None, **params)¶ Iterate over activity events.
Parameters: count – limit amount of elements. Returns: a generator object over a list of activity event dicts. Return type: types.GeneratorType[dict]
Collections¶
-
class
scrapinghub.client.collections.
Collection
(client, collections, type_, name)¶ Representation of a project collection object.
Not a public constructor: use
Collections
instance to get aCollection
instance. SeeCollections.get_store()
and similar methods.Usage:
add a new item to collection:
>>> foo_store.set({'_key': '002d050ee3ff6192dcbecc4e4b4457d7', ... 'value': '1447221694537'})
count items in collection:
>>> foo_store.count() 1
get an item from collection:
>>> foo_store.get('002d050ee3ff6192dcbecc4e4b4457d7') {'value': '1447221694537'}
get all items from collection:
>>> foo_store.iter() <generator object jldecode at 0x1049eef10>
iterate over _key & value pair:
>>> for elem in foo_store.iter(count=1)): ... print(elem) [{'_key': '002d050ee3ff6192dcbecc4e4b4457d7', 'value': '1447221694537'}]
get generator over item keys:
>>> keys = foo_store.iter(nodata=True, meta=["_key"])) >>> next(keys) {'_key': '002d050ee3ff6192dcbecc4e4b4457d7'}
filter by multiple keys, only values for keys that exist will be returned:
>>> foo_store.list(key=['002d050ee3ff6192dcbecc4e4b4457d7', 'blah']) [{'_key': '002d050ee3ff6192dcbecc4e4b4457d7', 'value': '1447221694537'}]
delete an item by key:
>>> foo_store.delete('002d050ee3ff6192dcbecc4e4b4457d7')
remove the entire collection with a single API call:
>>> foo_store.truncate()
-
count
(*args, **kwargs)¶ Count collection items with a given filters.
Returns: amount of elements in collection. Return type: int
-
create_writer
(start=0, auth=None, size=1000, interval=15, qsize=None, content_encoding='identity', maxitemsize=1048576, callback=None)¶ Create a new writer for a collection.
Parameters: - start – (optional) initial offset for writer thread.
- auth – (optional) set auth credentials for the request.
- size – (optional) set initial queue size.
- interval – (optional) set interval for writer thread.
- qsize – (optional) setup max queue size for the writer.
- content_encoding – (optional) set different Content-Encoding header.
- maxitemsize – (optional) max item size in bytes.
- callback – (optional) some callback function.
Returns: a new writer object.
Return type: scrapinghub.hubstorage.batchuploader._BatchWriter
If provided - calllback shouldn’t try to inject more items in the queue, otherwise it can lead to deadlocks.
-
delete
(keys)¶ Delete item(s) from collection by key(s).
Parameters: keys – a single key or a list of keys. The method returns
None
(original method returns an empty generator).
-
get
(key, **params)¶ Get item from collection by key.
Parameters: - key – string item key.
- **params – (optional) additional query params for the request.
Returns: an item dictionary if exists.
Return type: dict
-
iter
(key=None, prefix=None, prefixcount=None, startts=None, endts=None, requests_params=None, **params)¶ A method to iterate through collection items.
Parameters: - key – a string key or a list of keys to filter with.
- prefix – a string prefix to filter items.
- prefixcount – maximum number of values to return per prefix.
- startts – UNIX timestamp at which to begin results.
- endts – UNIX timestamp at which to end results.
- requests_params – (optional) a dict with optional requests params.
- **params – (optional) additional query params for the request.
Returns: an iterator over items list.
Return type: collections.Iterable[dict]
-
list
(key=None, prefix=None, prefixcount=None, startts=None, endts=None, requests_params=None, **params)¶ Convenient shortcut to list iter results.
Please note that
list()
method can use a lot of memory and for a large amount of logs it’s recommended to iterate through it viaiter()
method (all params and available filters are same for both methods).Parameters: - key – a string key or a list of keys to filter with.
- prefix – a string prefix to filter items.
- prefixcount – maximum number of values to return per prefix.
- startts – UNIX timestamp at which to begin results.
- endts – UNIX timestamp at which to end results.
- requests_params – (optional) a dict with optional requests params.
- **params – (optional) additional query params for the request.
Returns: a list of items where each item is represented with a dict.
Return type: list[dict]
-
set
(value)¶ Set item to collection by key.
Parameters: value – a dict representing a collection item. The method returns
None
(original method returns an empty generator).
-
truncate
()¶ Remove the entire collection with a single API call.
The method returns
None
(original method returns an empty generator).
-
class
scrapinghub.client.collections.
Collections
(cls, client, key)¶ Access to project collections.
Not a public constructor: use
Project
instance to get aCollections
instance. Seecollections
attribute.Usage:
>>> collections = project.collections >>> collections.list() [{'name': 'Pages', 'type': 's'}] >>> foo_store = collections.get_store('foo_store')
-
get
(type_, name)¶ Base method to get a collection with a given type and name.
Parameters: - type_ – a collection type string.
- name – a collection name string.
Returns: a collection object.
Return type:
-
get_cached_store
(name)¶ Method to get a cashed-store collection by name.
The collection type means that items expire after a month.
Parameters: name – a collection name string. Returns: a collection object. Return type: Collection
-
get_store
(name)¶ Method to get a store collection by name.
Parameters: name – a collection name string. Returns: a collection object. Return type: Collection
-
get_versioned_cached_store
(name)¶ Method to get a versioned-cached-store collection by name.
Multiple copies are retained, and each one expires after a month.
Parameters: name – a collection name string. Returns: a collection object. Return type: Collection
-
get_versioned_store
(name)¶ Method to get a versioned-store collection by name.
The collection type retains up to 3 copies of each item.
Parameters: name – a collection name string. Returns: a collection object. Return type: Collection
-
iter
()¶ Iterate through collections of a project.
Returns: an iterator over collections list where each collection is represented by a dictionary with (‘name’,’type’) fields. Return type: collections.Iterable[dict]
-
list
()¶ List collections of a project.
Returns: a list of collections where each collection is represented by a dictionary with (‘name’,’type’) fields. Return type: list[dict]
-
Exceptions¶
-
exception
scrapinghub.client.exceptions.
BadRequest
(message=None, http_error=None)¶ Usually raised in case of 400 response from API.
-
exception
scrapinghub.client.exceptions.
DuplicateJobError
(message=None, http_error=None)¶ Job for given spider with given arguments is already scheduled or running.
-
exception
scrapinghub.client.exceptions.
Forbidden
(message=None, http_error=None)¶ You don’t have the permission to access the requested resource. It is either read-protected or not readable by the server.
-
exception
scrapinghub.client.exceptions.
NotFound
(message=None, http_error=None)¶ Entity doesn’t exist (e.g. spider or project).
-
exception
scrapinghub.client.exceptions.
ScrapinghubAPIError
(message=None, http_error=None)¶ Base exception class.
-
exception
scrapinghub.client.exceptions.
ServerError
(message=None, http_error=None)¶ Indicates some server error: something unexpected has happened.
Request lacks valid authentication credentials for the target resource.
-
exception
scrapinghub.client.exceptions.
ValueTooLarge
(message=None, http_error=None)¶ Value cannot be writtent because it exceeds size limits.
Frontiers¶
-
class
scrapinghub.client.frontiers.
Frontier
(client, frontiers, name)¶ Representation of a frontier object.
Not a public constructor: use
Frontiers
instance to get aFrontier
instance. SeeFrontiers.get()
method.Usage:
get iterator with all slots:
>>> frontier.iter() <list_iterator at 0x1030736d8>
list all slots:
>>> frontier.list() ['example.com', 'example.com2']
get a slot by name:
>>> frontier.get('example.com') <scrapinghub.client.frontiers.FrontierSlot at 0x1049d8978>
flush frontier data:
>>> frontier.flush()
show amount of new requests added to frontier:
>>> frontier.newcount 3
-
flush
()¶ Flush data for a whole frontier.
-
get
(slot)¶ Get a slot by name.
Returns: a frontier slot instance. Return type: FrontierSlot
-
iter
()¶ Iterate through slots.
Returns: an iterator over frontier slots names. Return type: collections.Iterable[str]
-
list
()¶ List all slots.
Returns: a list of frontier slots names. Return type: list[str]
-
newcount
¶ Integer amount of new entries added to frontier.
-
class
scrapinghub.client.frontiers.
FrontierSlot
(client, frontier, slot)¶ Representation of a frontier slot object.
Not a public constructor: use
Frontier
instance to get aFrontierSlot
instance. SeeFrontier.get()
method.Usage:
add request to a queue:
>>> data = [{'fp': 'page1.html', 'p': 1, 'qdata': {'depth': 1}}] >>> slot.q.add('example.com', data)
add fingerprints to a slot:
>>> slot.f.add(['fp1', 'fp2'])
flush data for a slot:
>>> slot.flush()
show amount of new requests added to a slot:
>>> slot.newcount 2
read requests from a slot:
>>> slot.q.iter() <generator object jldecode at 0x1049aa9e8> >>> slot.q.list() [{'id': '0115a8579633600006', 'requests': [['page1.html', {'depth': 1}]]}]
read fingerprints from a slot:
>>> slot.f.iter() <generator object jldecode at 0x103de4938> >>> slot.f.list() ['page1.html']
delete a batch with requests from a slot:
>>> slot.q.delete('0115a8579633600006')
delete a whole slot:
>>> slot.delete()
-
delete
()¶ Delete the slot.
-
f
¶ Shortcut to have quick access to slot fingerprints.
Returns: fingerprints collection for the slot. Return type: FrontierSlotFingerprints
-
flush
()¶ Flush data for the slot.
-
newcount
¶ Integer amount of new entries added to slot.
-
q
¶ Shortcut to have quick access to a slot queue.
Returns: queue instance for the slot. Return type: FrontierSlotQueue
-
class
scrapinghub.client.frontiers.
FrontierSlotFingerprints
(slot)¶ Representation of request fingerprints collection stored in slot.
-
add
(fps)¶ Add new fingerprints to slot.
Parameters: fps – a list of string fingerprints to add.
-
iter
(**params)¶ Iterate through fingerprints in the slot.
Parameters: **params – (optional) additional query params for the request. Returns: an iterator over fingerprints. Return type: collections.Iterable[str]
-
list
(**params)¶ List fingerprints in the slot.
Parameters: **params – (optional) additional query params for the request. Returns: a list of fingerprints. Return type: list[str]
-
-
class
scrapinghub.client.frontiers.
FrontierSlotQueue
(slot)¶ Representation of request batches queue stored in slot.
-
add
(fps)¶ Add requests to the queue.
-
delete
(ids)¶ Delete request batches from the queue.
-
iter
(mincount=None, **params)¶ Iterate through batches in the queue.
Parameters: - mincount – (optional) limit results with min amount of requests.
- **params – (optional) additional query params for the request.
Returns: an iterator over request batches in the queue where each batch is represented with a dict with (‘id’, ‘requests’) field.
Return type: collections.Iterable[dict]
-
list
(mincount=None, **params)¶ List request batches in the queue.
Parameters: - mincount – (optional) limit results with min amount of requests.
- **params – (optional) additional query params for the request.
Returns: a list of request batches in the queue where each batch is represented with a dict with (‘id’, ‘requests’) field.
Return type: list[dict]
-
-
class
scrapinghub.client.frontiers.
Frontiers
(*args, **kwargs)¶ Frontiers collection for a project.
Not a public constructor: use
Project
instance to get aFrontiers
instance. Seefrontiers
attribute.Usage:
get all frontiers from a project:
>>> project.frontiers.iter() <list_iterator at 0x103c93630>
list all frontiers:
>>> project.frontiers.list() ['test', 'test1', 'test2']
get a frontier by name:
>>> project.frontiers.get('test') <scrapinghub.client.frontiers.Frontier at 0x1048ae4a8>
flush data of all frontiers of a project:
>>> project.frontiers.flush()
show amount of new requests added for all frontiers:
>>> project.frontiers.newcount 3
close batch writers of all frontiers of a project:
>>> project.frontiers.close()
-
close
()¶ Close frontier writer threads one-by-one.
-
flush
()¶ Flush data in all frontiers writer threads.
-
get
(name)¶ Get a frontier by name.
Parameters: name – a frontier name string. Returns: a frontier instance. Return type: Frontier
-
iter
()¶ Iterate through frontiers.
Returns: an iterator over frontiers names. Return type: collections.Iterable[str]
-
list
()¶ List frontiers names.
Returns: a list of frontiers names. Return type: list[str]
-
newcount
¶ Integer amount of new entries added to all frontiers.
Items¶
-
class
scrapinghub.client.items.
Items
(cls, client, key)¶ Representation of collection of job items.
Not a public constructor: use
Job
instance to get aItems
instance. Seeitems
attribute.Please note that
list()
method can use a lot of memory and for a large number of items it’s recommended to iterate through them viaiter()
method (all params and available filters are same for both methods).Usage:
retrieve all scraped items from a job:
>>> job.items.iter() <generator object mpdecode at 0x10f5f3aa0>
iterate through first 100 items and print them:
>>> for item in job.items.iter(count=100): ... print(item)
retrieve items with timestamp greater or equal to given timestamp (item here is an arbitrary dictionary depending on your code):
>>> job.items.list(startts=1447221694537) [{ 'name': ['Some custom item'], 'url': 'http://some-url/item.html', 'size': 100000, }]
retrieve items via a generator of lists. This is most useful in cases where the job has a huge amount of items and it needs to be broken down into chunks when consumed. This example shows a job with 3 items:
>>> gen = job.items.list_iter(chunksize=2) >>> next(gen) [{'name': 'Item #1'}, {'name': 'Item #2'}] >>> next(gen) [{'name': 'Item #3'}] >>> next(gen) Traceback (most recent call last): File "<stdin>", line 1, in <module> StopIteration
retrieving via meth::list_iter also supports the start and count. params. This is useful when you want to only retrieve a subset of items in a job. The example below belongs to a job with 10 items:
>>> gen = job.items.list_iter(chunksize=2, start=5, count=3) >>> next(gen) [{'name': 'Item #5'}, {'name': 'Item #6'}] >>> next(gen) [{'name': 'Item #7'}] >>> next(gen) Traceback (most recent call last): File "<stdin>", line 1, in <module> StopIteration
retrieve 1 item with multiple filters:
>>> filters = [("size", ">", [30000]), ("size", "<", [40000])] >>> job.items.list(count=1, filter=filters) [{ 'name': ['Some other item'], 'url': 'http://some-url/other-item.html', 'size': 35000, }]
-
close
(block=True)¶ Close writers one-by-one.
-
flush
()¶ Flush data from writer threads.
-
get
(key, **params)¶ Get element from collection.
Parameters: key – element key. Returns: a dictionary with element data. Return type: dict
-
iter
(_path=None, count=None, requests_params=None, **apiparams)¶ A general method to iterate through elements.
Parameters: count – limit amount of elements. Returns: an iterator over elements list. Return type: collections.Iterable
-
list
(*args, **kwargs)¶ Convenient shortcut to list iter results.
Please note that
list()
method can use a lot of memory and for a large amount of elements it’s recommended to iterate through it viaiter()
method (all params and available filters are same for both methods).
-
list_iter
(chunksize=1000, *args, **kwargs)¶ An alternative interface for reading items by returning them as a generator which yields lists of items sized as chunksize.
This is a convenient method for cases when processing a large amount of items from a job isn’t ideal in one go due to the large memory needed. Instead, this allows you to process it chunk by chunk.
You can improve I/O overheads by increasing the chunk value but that would also increase the memory consumption.
Parameters: - chunksize – size of list to be returned per iteration
- start – offset to specify the start of the item iteration
- count – overall number of items to be returned, which is broken down by chunksize.
Returns: an iterator over items, yielding lists of items.
Return type: collections.Iterable
-
stats
()¶ Get resource stats.
Returns: a dictionary with stats data. Return type: dict
-
write
(item)¶ Write new element to collection.
Parameters: item – element data dict to write.
Jobs¶
-
class
scrapinghub.client.jobs.
Job
(client, job_key)¶ Class representing a job object.
Not a public constructor: use
ScrapinghubClient
instance orJobs
instance to get aJob
instance. Seescrapinghub.client.ScrapinghubClient.get_job()
andJobs.get()
methods.Variables: Usage:
>>> job = project.jobs.get('123/1/2') >>> job.key '123/1/2' >>> job.metadata.get('state') 'finished'
-
cancel
()¶ Schedule a running job for cancellation.
Usage:
>>> job.cancel() >>> job.metadata.get('cancelled_by') 'John'
-
close_writers
()¶ Stop job batch writers threads gracefully.
Called on
ScrapinghubClient.close()
method.
-
delete
(**params)¶ Mark finished job for deletion.
Parameters: **params – (optional) keyword meta parameters to update. Returns: a previous string job state. Return type: str
Usage:
>>> job.delete() 'finished'
-
finish
(**params)¶ Move running job to finished state.
Parameters: **params – (optional) keyword meta parameters to update. Returns: a previous string job state. Return type: str
Usage:
>>> job.finish() 'running'
-
start
(**params)¶ Move job to running state.
Parameters: **params – (optional) keyword meta parameters to update. Returns: a previous string job state. Return type: str
Usage:
>>> job.start() 'pending'
-
update
(state, **params)¶ Update job state.
Parameters: - state – a new job state.
- **params – (optional) keyword meta parameters to update.
Returns: a previous string job state.
Return type: str
Usage:
>>> job.update('finished') 'running'
Partially update job tags.
It provides a convenient way to mark specific jobs (for better search, postprocessing etc).
Parameters: - add – (optional) list of tags to add.
- remove – (optional) list of tags to remove.
Usage: to mark a job with tag
consumed
:>>> job.update_tags(add=['consumed'])
-
-
class
scrapinghub.client.jobs.
JobMeta
(cls, client, key)¶ Class representing job metadata.
Not a public constructor: use
Job
instance to get aJobMeta
instance. Seemetadata
attribute.Usage:
get job metadata instance:
>>> job.metadata <scrapinghub.client.jobs.JobMeta at 0x10494f198>
iterate through job metadata:
>>> job.metadata.iter() <dict_itemiterator at 0x104adbd18>
list job metadata:
>>> job.metadata.list() [('project', 123), ('units', 1), ('state', 'finished'), ...]
get meta field value by name:
>>> job.metadata.get('version') 'test'
update job meta field value (some meta fields are read-only):
>>> job.metadata.set('my-meta', 'test')
update multiple meta fields at once
>>> job.metadata.update({'my-meta1': 'test1', 'my-meta2': 'test2'})
delete meta field by name:
>>> job.metadata.delete('my-meta')
-
delete
(key)¶ Delete element by key.
Parameters: key – a string key
-
get
(key)¶ Get element value by key.
Parameters: key – a string key
-
iter
()¶ Iterate through key/value pairs.
Returns: an iterator over key/value pairs. Return type: collections.Iterable
-
list
(*args, **kwargs)¶ Convenient shortcut to list iter results.
Please note that
list()
method can use a lot of memory and for a large amount of elements it’s recommended to iterate through it viaiter()
method (all params and available filters are same for both methods).
-
set
(key, value)¶ Set element value.
Parameters: - key – a string key
- value – new value to set for the key
-
update
(values)¶ Update multiple elements at once.
The method provides convenient interface for partial updates.
Parameters: values – a dictionary with key/values to update.
-
class
scrapinghub.client.jobs.
Jobs
(client, project_id, spider=None)¶ Class representing a collection of jobs for a project/spider.
Not a public constructor: use
Project
instance orSpider
instance to get aJobs
instance. Seescrapinghub.client.projects.Project.jobs
andscrapinghub.client.spiders.Spider.jobs
attributes.Variables: - project_id – a string project id.
- spider –
Spider
object if defined.
Usage:
>>> project.jobs <scrapinghub.client.jobs.Jobs at 0x10477f0b8> >>> spider = project.spiders.get('spider1') >>> spider.jobs <scrapinghub.client.jobs.Jobs at 0x104767e80>
-
cancel
(keys=None, count=None, **params)¶ Cancel a list of jobs using the keys provided.
Parameters: - keys – (optional) a list of strings containing the job keys in the format: <project>/<spider>/<job_id>.
- count – (optional) it requires admin access. Used for admins
to bulk cancel an amount of
count
jobs.
Returns: a dict with the amount of jobs cancelled.
Return type: dict
Usage:
cancel jobs 123 and 321 from project 111 and spiders 222 and 333:
>>> project.jobs.cancel(['111/222/123', '111/333/321']) {'count': 2}
cancel 100 jobs asynchronously:
>>> project.jobs.cancel(count=100) {'count': 100}
-
count
(spider=None, state=None, has_tag=None, lacks_tag=None, startts=None, endts=None, **params)¶ Count jobs with a given set of filters.
Parameters: - spider – (optional) filter by spider name.
- state – (optional) a job state, a string or a list of strings.
- has_tag – (optional) filter results by existing tag(s), a string or a list of strings.
- lacks_tag – (optional) filter results by missing tag(s), a string or a list of strings.
- startts – (optional) UNIX timestamp at which to begin results, in milliseconds.
- endts – (optional) UNIX timestamp at which to end results, in milliseconds.
- **params – (optional) other filter params.
Returns: jobs count.
Return type: int
The endpoint used by the method counts only finished jobs by default, use
state
parameter to count jobs in other states.Usage:
>>> spider = project.spiders.get('spider1') >>> spider.jobs.count() 5 >>> project.jobs.count(spider='spider2', state='finished') 2
-
get
(job_key)¶ Get a
Job
with a given job_key.Parameters: job_key – a string job key. job_key’s project component should match the project used to get
Jobs
instance, and job_key’s spider component should match the spider (ifSpider
was used to getJobs
instance).Returns: a job object. Return type: Job
Usage:
>>> job = project.jobs.get('123/1/2') >>> job.key '123/1/2'
-
iter
(count=None, start=None, spider=None, state=None, has_tag=None, lacks_tag=None, startts=None, endts=None, meta=None, **params)¶ Iterate over jobs collection for a given set of params.
Parameters: - count – (optional) limit amount of returned jobs.
- start – (optional) number of jobs to skip in the beginning.
- spider – (optional) filter by spider name.
- state – (optional) a job state, a string or a list of strings.
- has_tag – (optional) filter results by existing tag(s), a string or a list of strings.
- lacks_tag – (optional) filter results by missing tag(s), a string or a list of strings.
- startts – (optional) UNIX timestamp at which to begin results, in millisecons.
- endts – (optional) UNIX timestamp at which to end results, in millisecons.
- meta – (optional) request for additional fields, a single field name or a list of field names to return.
- **params – (optional) other filter params.
Returns: a generator object over a list of dictionaries of jobs summary for a given filter params.
Return type: types.GeneratorType[dict]
The endpoint used by the method returns only finished jobs by default, use
state
parameter to return jobs in other states.Usage:
retrieve all jobs for a spider:
>>> spider.jobs.iter() <generator object jldecode at 0x1049bd570>
get all job keys for a spider:
>>> jobs_summary = spider.jobs.iter() >>> [job['key'] for job in jobs_summary] ['123/1/3', '123/1/2', '123/1/1']
job summary fieldset is less detailed than
JobMeta
but contains a few new fields as well. Additional fields can be requested usingmeta
parameter. If it’s used, then it’s up to the user to list all the required fields, so only few default fields would be added except requested ones:>>> jobs_summary = project.jobs.iter(meta=['scheduled_by', ])
by default
Jobs.iter()
returns maximum last 1000 results. Pagination is available using start parameter:>>> jobs_summary = spider.jobs.iter(start=1000)
get jobs filtered by tags (list of tags has
OR
power):>>> jobs_summary = project.jobs.iter( ... has_tag=['new', 'verified'], lacks_tag='obsolete')
get certain number of last finished jobs per some spider:
>>> jobs_summary = project.jobs.iter( ... spider='spider2', state='finished', count=3)
-
iter_last
(start=None, start_after=None, count=None, spider=None, **params)¶ Iterate through last jobs for each spider.
Parameters: - start – (optional)
- start_after – (optional)
- count – (optional)
- spider – (optional) a spider name (not needed if instantiated
with
Spider
). - **params – (optional) additional keyword args.
Returns: a generator object over a list of dictionaries of jobs summary for a given filter params.
Return type: types.GeneratorType[dict]
Usage:
get all last job summaries for a project:
>>> project.jobs.iter_last() <generator object jldecode at 0x1048a95c8>
get last job summary for a a spider:
>>> list(spider.jobs.iter_last()) [{'close_reason': 'success', 'elapsed': 3062444, 'errors': 1, 'finished_time': 1482911633089, 'key': '123/1/3', 'logs': 8, 'pending_time': 1482911596566, 'running_time': 1482911598909, 'spider': 'spider1', 'state': 'finished', 'ts': 1482911615830, 'version': 'some-version'}]
-
list
(count=None, start=None, spider=None, state=None, has_tag=None, lacks_tag=None, startts=None, endts=None, meta=None, **params)¶ Convenient shortcut to list iter results.
Parameters: - count – (optional) limit amount of returned jobs.
- start – (optional) number of jobs to skip in the beginning.
- spider – (optional) filter by spider name.
- state – (optional) a job state, a string or a list of strings.
- has_tag – (optional) filter results by existing tag(s), a string or a list of strings.
- lacks_tag – (optional) filter results by missing tag(s), a string or a list of strings.
- startts – (optional) UNIX timestamp at which to begin results, in milliseconds.
- endts – (optional) UNIX timestamp at which to end results, in milliseconds.
- meta – (optional) request for additional fields, a single field name or a list of field names to return.
- **params – (optional) other filter params.
Returns: list of dictionaries of jobs summary for a given filter params.
Return type: list[dict]
The endpoint used by the method returns only finished jobs by default, use
state
parameter to return jobs in other states.Please note that
list()
can use a lot of memory and for a large amount of logs it’s recommended to iterate through it viaiter()
method (all params and available filters are same for both methods).
-
run
(spider=None, units=None, priority=None, meta=None, add_tag=None, job_args=None, job_settings=None, cmd_args=None, environment=None, **params)¶ Schedule a new job and returns its job key.
Parameters: - spider – a spider name string
(not needed if job is scheduled via
Spider.jobs
). - units – (optional) amount of units for the job.
- priority – (optional) integer priority value.
- meta – (optional) a dictionary with metadata.
- add_tag – (optional) a string tag or a list of tags to add.
- job_args – (optional) a dictionary with job arguments.
- job_settings – (optional) a dictionary with job settings.
- cmd_args – (optional) a string with script command args.
- environment – (option) a dictionary with custom environment
- **params – (optional) additional keyword args.
Returns: a job instance, representing the scheduled job.
Return type: Usage:
>>> job = project.jobs.run('spider1', job_args={'arg1': 'val1'}) >>> job <scrapinghub.client.jobs.Job at 0x7fcb7c01df60> >>> job.key '123/1/1'
- spider – a spider name string
(not needed if job is scheduled via
-
summary
(state=None, spider=None, **params)¶ Get jobs summary (optionally by state).
Parameters: - state – (optional) a string state to filter jobs.
- spider – (optional) a spider name (not needed if instantiated
with
Spider
). - **params – (optional) additional keyword args.
Returns: a list of dictionaries of jobs summary for a given filter params grouped by job state.
Return type: list[dict]
Usage:
>>> spider.jobs.summary() [{'count': 0, 'name': 'pending', 'summary': []}, {'count': 0, 'name': 'running', 'summary': []}, {'count': 5, 'name': 'finished', 'summary': [...]} >>> project.jobs.summary('pending') {'count': 0, 'name': 'pending', 'summary': []}
Update tags for all existing spider jobs.
Parameters: - add – (optional) list of tags to add to selected jobs.
- remove – (optional) list of tags to remove from selected jobs.
- spider – (optional) spider name, must if used with
Project.jobs
.
It’s not allowed to update tags for all project jobs, so spider must be specified (it’s done implicitly when using
Spider.jobs
, or you have to specifyspider
param when usingProject.jobs
).Returns: amount of jobs that were updated. Return type: int
Usage:
mark all spider jobs with tag
consumed
:>>> spider = project.spiders.get('spider1') >>> spider.jobs.update_tags(add=['consumed']) 5
remove existing tag
existing
for all spider jobs:>>> project.jobs.update_tags( ... remove=['existing'], spider='spider2') 2
Logs¶
-
class
scrapinghub.client.logs.
Logs
(cls, client, key)¶ Representation of collection of job logs.
Not a public constructor: use
Job
instance to get aLogs
instance. Seelogs
attribute.Please note that
list()
method can use a lot of memory and for a large amount of logs it’s recommended to iterate through it viaiter()
method (all params and available filters are same for both methods).Usage:
retrieve all logs from a job:
>>> job.logs.iter() <generator object mpdecode at 0x10f5f3aa0>
iterate through first 100 log entries and print them:
>>> for log in job.logs.iter(count=100): ... print(log)
retrieve a single log entry from a job:
>>> job.logs.list(count=1) [{ 'level': 20, 'message': '[scrapy.core.engine] Closing spider (finished)', 'time': 1482233733976, }]
retrive logs with a given log level and filter by a word:
>>> filters = [("message", "contains", ["mymessage"])] >>> job.logs.list(level='WARNING', filter=filters) [{ 'level': 30, 'message': 'Some warning: mymessage', 'time': 1486375511188, }]
-
batch_write_start
()¶ Override to set a start parameter when commencing writing.
-
close
(block=True)¶ Close writers one-by-one.
-
debug
(message, **other)¶ Log a message with DEBUG level.
-
error
(message, **other)¶ Log a message with ERROR level.
-
flush
()¶ Flush data from writer threads.
-
get
(key, **params)¶ Get element from collection.
Parameters: key – element key. Returns: a dictionary with element data. Return type: dict
-
info
(message, **other)¶ Log a message with INFO level.
-
iter
(_path=None, count=None, requests_params=None, **apiparams)¶ A general method to iterate through elements.
Parameters: count – limit amount of elements. Returns: an iterator over elements list. Return type: collections.Iterable
-
list
(*args, **kwargs)¶ Convenient shortcut to list iter results.
Please note that
list()
method can use a lot of memory and for a large amount of elements it’s recommended to iterate through it viaiter()
method (all params and available filters are same for both methods).
-
log
(message, level=20, ts=None, **other)¶ Base method to write a log entry.
Parameters: - message – a string message.
- level – (optional) logging level, default to INFO.
- ts – (optional) UNIX timestamp in milliseconds.
- **other – other optional kwargs.
-
stats
()¶ Get resource stats.
Returns: a dictionary with stats data. Return type: dict
-
warn
(message, **other)¶ Log a message with WARN level.
-
warning
(message, **other)¶ Log a message with WARN level.
-
write
(item)¶ Write new element to collection.
Parameters: item – element data dict to write.
Projects¶
-
class
scrapinghub.client.projects.
Project
(client, project_id)¶ Class representing a project object and its resources.
Not a public constructor: use
ScrapinghubClient
instance orProjects
instance to get aProject
instance. Seescrapinghub.client.ScrapinghubClient.get_project()
orProjects.get()
methods.Variables: - key – string project id.
- activity –
Activity
resource object. - collections –
Collections
resource object. - frontiers –
Frontiers
resource object. - jobs –
Jobs
resource object. - settings –
Settings
resource object. - spiders –
Spiders
resource object.
Usage:
>>> project = client.get_project(123) >>> project <scrapinghub.client.projects.Project at 0x106cdd6a0> >>> project.key '123'
-
class
scrapinghub.client.projects.
Projects
(client)¶ Collection of projects available to current user.
Not a public constructor: use
ScrapinghubClient
client instance to get aProjects
instance. Seescrapinghub.client.Scrapinghub.projects
attribute.Usage:
>>> client.projects <scrapinghub.client.projects.Projects at 0x1047ada58>
-
get
(project_id)¶ Get project for a given project id.
Parameters: project_id – integer or string numeric project id. Returns: a project object. Return type: Project
Usage:
>>> project = client.projects.get(123) >>> project <scrapinghub.client.projects.Project at 0x106cdd6a0>
-
iter
()¶ Iterate through list of projects available to current user.
Provided for the sake of API consistency.
Returns: an iterator over project ids list. Return type: collections.Iterable[int]
-
list
()¶ Get list of projects available to current user.
Returns: a list of project ids. Return type: list[int]
Usage:
>>> client.projects.list() [123, 456]
-
summary
(state=None, **params)¶ Get short summaries for all available user projects.
Parameters: state – a string state or a list of states. Returns: a list of dictionaries: each dictionary represents a project summary (amount of pending/running/finished jobs and a flag if it has a capacity to run new jobs). Return type: list[dict]
Usage:
>>> client.projects.summary() [{'finished': 674, 'has_capacity': True, 'pending': 0, 'project': 123, 'running': 1}, {'finished': 33079, 'has_capacity': True, 'pending': 0, 'project': 456, 'running': 2}]
-
-
class
scrapinghub.client.projects.
Settings
(cls, client, key)¶ Class representing job metadata.
Not a public constructor: use
Project
instance to get aSettings
instance. SeeProject.settings
attribute.Usage:
get project settings instance:
>>> project.settings <scrapinghub.client.projects.Settings at 0x10ecf1250>
iterate through project settings:
>>> project.settings.iter() <dictionary-itemiterator at 0x10ed11578>
list project settings:
>>> project.settings.list() [(u'default_job_units', 2), (u'job_runtime_limit', 20)]
get setting value by name:
>>> project.settings.get('default_job_units') 2
update setting value (some settings are read-only):
>>> project.settings.set('default_job_units', 2)
update multiple settings at once:
>>> project.settings.update({'default_job_units': 1, ... 'job_runtime_limit': 20})
delete project setting by name:
>>> project.settings.delete('job_runtime_limit')
-
delete
(key)¶ Delete element by key.
Parameters: key – a string key
-
get
(key)¶ Get element value by key.
Parameters: key – a string key
-
iter
()¶ Iterate through key/value pairs.
Returns: an iterator over key/value pairs. Return type: collections.Iterable
-
list
(*args, **kwargs)¶ Convenient shortcut to list iter results.
Please note that
list()
method can use a lot of memory and for a large amount of elements it’s recommended to iterate through it viaiter()
method (all params and available filters are same for both methods).
-
set
(key, value)¶ Update project setting value by key.
Parameters: - key – a string setting key.
- value – new setting value.
-
update
(values)¶ Update multiple elements at once.
The method provides convenient interface for partial updates.
Parameters: values – a dictionary with key/values to update.
Requests¶
-
class
scrapinghub.client.requests.
Requests
(cls, client, key)¶ Representation of collection of job requests.
Not a public constructor: use
Job
instance to get aRequests
instance. Seerequests
attribute.Please note that
list()
method can use a lot of memory and for a large amount of logs it’s recommended to iterate through it viaiter()
method (all params and available filters are same for both methods).Usage:
retrieve all requests from a job:
>>> job.requests.iter() <generator object mpdecode at 0x10f5f3aa0>
iterate through the requests:
>>> for reqitem in job.requests.iter(count=1): ... print(reqitem['time']) 1482233733870
retrieve single request from a job:
>>> job.requests.list(count=1) [{ 'duration': 354, 'fp': '6d748741a927b10454c83ac285b002cd239964ea', 'method': 'GET', 'rs': 1270, 'status': 200,a 'time': 1482233733870, 'url': 'https://example.com' }]
-
add
(url, status, method, rs, duration, ts, parent=None, fp=None)¶ Add a new requests.
Parameters: - url – string url for the request.
- status – HTTP status of the request.
- method – stringified request method.
- rs – response body length.
- duration – request duration in milliseconds.
- ts – UNIX timestamp in milliseconds.
- parent – (optional) parent request id.
- fp – (optional) string fingerprint for the request.
-
close
(block=True)¶ Close writers one-by-one.
-
flush
()¶ Flush data from writer threads.
-
get
(key, **params)¶ Get element from collection.
Parameters: key – element key. Returns: a dictionary with element data. Return type: dict
-
iter
(_path=None, count=None, requests_params=None, **apiparams)¶ A general method to iterate through elements.
Parameters: count – limit amount of elements. Returns: an iterator over elements list. Return type: collections.Iterable
-
list
(*args, **kwargs)¶ Convenient shortcut to list iter results.
Please note that
list()
method can use a lot of memory and for a large amount of elements it’s recommended to iterate through it viaiter()
method (all params and available filters are same for both methods).
-
stats
()¶ Get resource stats.
Returns: a dictionary with stats data. Return type: dict
-
write
(item)¶ Write new element to collection.
Parameters: item – element data dict to write.
Samples¶
-
class
scrapinghub.client.samples.
Samples
(cls, client, key)¶ Representation of collection of job samples.
Not a public constructor: use
Job
instance to get aSamples
instance. Seesamples
attribute.Please note that
list()
method can use a lot of memory and for a large amount of logs it’s recommended to iterate through it viaiter()
method (all params and available filters are same for both methods).Usage:
retrieve all samples from a job:
>>> job.samples.iter() <generator object mpdecode at 0x10f5f3aa0>
retrieve samples with timestamp greater or equal to given timestamp:
>>> job.samples.list(startts=1484570043851) [[1484570043851, 554, 576, 1777, 821, 0], [1484570046673, 561, 583, 1782, 821, 0]]
-
close
(block=True)¶ Close writers one-by-one.
-
flush
()¶ Flush data from writer threads.
-
get
(key, **params)¶ Get element from collection.
Parameters: key – element key. Returns: a dictionary with element data. Return type: dict
-
iter
(_key=None, count=None, **params)¶ Iterate over elements in collection.
Parameters: count – limit amount of elements. Returns: a generator object over a list of element dictionaries. Return type: types.GeneratorType[dict]
-
list
(*args, **kwargs)¶ Convenient shortcut to list iter results.
Please note that
list()
method can use a lot of memory and for a large amount of elements it’s recommended to iterate through it viaiter()
method (all params and available filters are same for both methods).
-
stats
()¶ Get resource stats.
Returns: a dictionary with stats data. Return type: dict
-
write
(item)¶ Write new element to collection.
Parameters: item – element data dict to write.
Spiders¶
-
class
scrapinghub.client.spiders.
Spider
(client, project_id, spider_id, spider)¶ Class representing a Spider object.
Not a public constructor: use
Spiders
instance to get aSpider
instance. SeeSpiders.get()
method.Variables: - project_id – a string project id.
- key – a string key in format ‘project_id/spider_id’.
- name – a spider name string.
- jobs – a collection of jobs,
Jobs
object.
Usage:
>>> spider = project.spiders.get('spider1') >>> spider.key '123/1' >>> spider.name 'spider1'
List spider tags.
Returns: a list of spider tags. Return type: list[str]
Update tags for the spider.
Parameters: - add – (optional) a list of string tags to add.
- remove – (optional) a list of string tags to remove.
-
class
scrapinghub.client.spiders.
Spiders
(client, project_id)¶ Class to work with a collection of project spiders.
Not a public constructor: use
Project
instance to get aSpiders
instance. Seespiders
attribute.Variables: project_id – string project id. Usage:
>>> project.spiders <scrapinghub.client.spiders.Spiders at 0x1049ca630>
-
get
(spider, **params)¶ Get a spider object for a given spider name.
The method gets/sets spider id (and checks if spider exists).
Parameters: spider – a string spider name. Returns: a spider object. Return type: scrapinghub.client.spiders.Spider
Usage:
>>> project.spiders.get('spider2') <scrapinghub.client.spiders.Spider at 0x106ee3748> >>> project.spiders.get('non-existing') NotFound: Spider non-existing doesn't exist.
-
iter
()¶ Iterate through a list of spiders for a project.
Returns: an iterator over spiders list where each spider is represented as a dict containing its metadata. Return type: collection.Iterable[dict]
Provided for the sake of API consistency.
-
list
()¶ Get a list of spiders for a project.
Returns: a list of dictionaries with spiders metadata. Return type: list[dict]
Usage:
>>> project.spiders.list() [{'id': 'spider1', 'tags': [], 'type': 'manual', 'version': '123'}, {'id': 'spider2', 'tags': [], 'type': 'manual', 'version': '123'}]
-