Extract content from web pages, including link URLs, image URLs and entire web page contents.
android
, chrome
, googlebot
, ie
, ios
, opera
, safari
chrome
en
Extract Link URLs
and Extract Image URLs
commands using a regular expression.https?://.+
Extract Contents
command. An example is included in the package as default-config.json
.extract-web/default-config.json
Extract Contents
command.json
, yaml
json
Extract Contents
command when it is in JSON mode.2
Extract Contents
command when it is in YAML mode.2
The Extract Contents command outputs a JSON or YAML document containing an array of objects. Each extracted web page is represented by a JSON/YAML object in this array.
The properties
object for each extracted web page contains an array of properties extracted from the web page.
If you want to customize the properties extracted from each item, prepare a configuration file similar to the example below. Properties to extract are specified using CSS syntax.
Example:
{"target": [{"pattern": {"url": "https://atom.io/packages/.*"},"properties": {"title": {"text": "title"},"body": {"text": "body"},"bodyAsHtml": {"html": "body"},"package_meta": {"text": ".package-meta ul li a","isArray": true},"meta_description": {"attr": "meta[name=description]","args": ["content"]},"domain": {"default": "atom.io"}}}]}
Good catch. Let us know what about this package looks wrong to you, and we'll investigate right away.