Throw new error cheerio load expects a string - Исправление ошибок и поиск оптимальных решений проблем

* Use parse5 as a default parser (closes #863)

* Use documents via $.load

* Add test for #997

* Change options format

* Update unit test

Update test phrasing according to recent changes in parsing logic.

* 1.0.0-rc.1

* Improve release process

Limit responsibility of "pre-publish" script to simply validate the
project's `History.md` file (by verifying an entry for the current
release). Define a separate script for history generation. Separating
the workflow in this way allows for manual modification of the release
notes.

* Correct errors in Readme.md

* Document advanced usage with htmlparser2

* Update History.md (and include migration guide)

* Remove documentation for `xmlMode` option

Simply expose an option named `xml` that enables XML parsing via
htmlparser2 with the ability to specify additional options for that
parser.

* Rename `useHtmlParser2` option

This flag is used to control parsing behavior internally, but it is not
intended for use by consumers. Prefix the name with an underscore in
order to discourage users from relying on it.

* Re-write migration guide for version 1.0 (#1078)

Incorporate recent feedback from consumers who have experimented with
the version 1.0 release candidate.

* Pass locationInfo option to parse5 (#1155)

* Update css-select to the latest version 🚀 (#1158)

Breaking change: Invalid selectors now throw Errors, not SyntaxErrors.

* Use eslint & prettier, add precommit hook (#1152)

* chore(package): update mocha to version 5.0.4 (#1088)

* Ensure text nodes expose the DOM level 1 API

Since enabling the `withDomLvl1` parsing option, nodes cannot be created
with an object literal. Create new text nodes using the `evaluate`
function to ensure they expose the correct attributes.

* fixing missing prop(‘outerHTML’) implementation. Added an ‘outerHTML’ case to the switch in the prop function, which wraps a clone of  in a container element, and sets  to that container's HTML (#945)

* Do not lint files excluded from version control (#1162)

This includes code coverage reports as generated by the command `make
test-cov`.

* Correct typo in git hook configuration (#1163)

* Correct typo in git hook configuration

* Reformat package manifest to satisfy linter

* Fix .text with a function as the argument

* Fix `.text` being called on a collection with size > 1 with a function

* chore(package): update coveralls to version 3.0.0 (#1086)

* Update jsdom to the latest version 🚀 (#1008)

* Throw a useful error on invalid input to cheerio.load() (#1087)

* Procedurally generate API documentation from source (#1165)

* Use parse5 to serialize the DOM, use lodash to clone dom

* Fix DoS/RCE vulnerability in lodash@4.15.0 (#1179)

fixes #1175

*  Add eslint-plugin-jsdoc, improve documentation (#1168)

* Improve variable names (#1183)

Promote consistency in variable names within the project's source and
unit tests. This helps to highlight the distinction between the object
exported by the module and the function produced by the `load` method.
The latter value is expected to mimic the jQuery API, while the former
value generally should only be used for a small set of methods specific
to Cheerio:

- `load`
- `html`
- `xml`
- `text`

Other usages of the exported object are discouraged, and a future patch
will update the unit tests to reflect the usages that are endorsed for
long-term stability.

* Formally test deprecated APIs (#1184)

Methods named `load`, `html`, `xml`, and `text` are defined in many
locations:

Today, Cheerio defines multiple versions of methods named `load`,
`html`, `xml`, `text`, and `parseHTML`. These alternate versions may be
defined in up to three distinct parts of the API:

- exported by the Cheerio module
- as static methods of the "loaded" Cheerio factory function
- as instance methods of the "loaded" Cheerio factory function

Some of these are surperfluous, and because some unecessarily conflict
with idiomatic jQuery coding patterns, they have been designated for
future removal [1].

Add tests for the deprecated methods in order to avoid regressions prior
to their removal. Insert comments to delineate the methods which are
endorsed and which have been deprecated. For the latter group of
methods, include recommendation for the preferred alternatives.

[1] #1122

* Implement for...of iterator via Symbol.iterator (#1197)

* Implement for...of iterator via Symbol.iterator

Similar to jQuery: https://github.com/jquery/jquery/blob/1ea092a54b00aa4d902f4e22ada3854d195d4a18/src/core.js#L371-L373

Fixes #1191

* Assert that the iterator ends

#1197 (comment)

Источник

8 сентября, 2015 12:54 пп
5 541 views
| Комментариев нет

Java

Web scraping – это технология извлечения информации с веб-сайтов.

В этом руководстве показано, как выполнить scraping на примере первой страницы Hacker News, чтобы получить все самые популярные ссылки, а также их метаданные: например, название, URL и количество полученных комментариев. Для этого применяется техника извлечения данных страницы при помощи node.js и модуля cheerio.

Cheerio – это легковесный, быстрый и гибкий модуль на основе jQuery.

Также в руководстве используется модуль request – упрощённый HTTP-клиент.

Требования

Чтобы следовать руководству, нужно обладать навыками работы с node.js и jQuery, а также уметь выполнять основные задачи Linux (например, работать с SSH).

Также нужно предварительно установить node.js.

Примечание: Полезные инструкции по установке и использованию node.js можно найти в специальном разделе нашего сайта.

Установка модулей cheerio и request

Чтобы установить необходимые модули с помощью NPM, используйте команду:

npm install request cheerio

Эта команда установит модули только в текущий рабочий каталог.

Чтобы установить модули глобально, запустите:

npm install -g request cheerio

Создайте файл scrape.js и вставьте в него следующий код:

var request = require('request'); var cheerio = require('cheerio');

Он загрузит зависимости модулей.

Web scraping страницы

Теперь загрузите первую страницу Hacker News при помощи простого запроса и откройте её HTML-код, если не возникло никаких ошибок, а код состояния равен 200.

Вставьте в файл следующие строки:

request('https://news.ycombinator.com', function (error, response, html) { if (!error && response.statusCode == 200) { console.log(html); } });

Попробуйте запустить крипт при помощи node scrape.js. На экране появится HTML-код в окне терминала.

Чтобы понять, как извлечь необходимые метаданные, нужно знать, как структурированы элементы в HTML-коде. Предпочтительнее всего использовать инструменты веб-разработчика, встроенные в Google Chrome, чтобы осмотреть целевой элемент веб-страницы; для этого достаточно просто щелкнуть правой кнопкой мыши на нем и выбрать Inspect element (или Просмотр кода элемента).

Ознакомьтесь со структурой элементов кода.

В этом примере для того, чтобы быстро извлечь данные 30 топ-ссылок страницы, достаточно просто выбрать каждый элемент «span» с дополнительным классом «comhead» и каждый сопутствующий элемент «а»; это можно сделать с помощью специальной функции JQuery API – prev().

Измените код запроса следующим образом:

request('https://news.ycombinator.com', function (error, response, html) { if (!error && response.statusCode == 200) { var $ = cheerio.load(html); $('span.comhead').each(function(i, element){ var a = $(this).prev(); console.log(a.text()); }); } });

Как видите, запустив данный код, вы получили список из 30 названий. Снова отредактируйте код запроса, чтобы получить остальные метаданные:

request('https://news.ycombinator.com', function (error, response, html) { if (!error && response.statusCode == 200) { var $ = cheerio.load(html); $('span.comhead').each(function(i, element){ var a = $(this).prev(); var rank = a.parent().parent().text(); var title = a.text(); var url = a.attr('href'); var subtext = a.parent().parent().next().children('.subtext').children(); var points = $(subtext).eq(0).text(); var username = $(subtext).eq(1).text(); var comments = $(subtext).eq(2).text(); // Our parsed meta data object var metadata = { rank: parseInt(rank), title: title, url: url, points: parseInt(points), username: username, comments: parseInt(comments) }; console.log(metadata); }); } });

Этот код выполняет следующие действия:

Выбирает предыдущий элемент:

var a = $(this).prev();

Получает ранг путем разбора элемента на два уровня выше элемента «а»:

var rank = a.parent().parent().text();

Разбирает название ссылки:

var title = a.text();

Разбирает атрибут href элемента «а»:

var url = a.attr('href');

Получает подтекст следующей строки HTML-таблицы:

var subtext = a.parent().parent().next().children('.subtext').children();

Извлекает соответствующие данные:

var points = $(subtext).eq(0).text(); var username = $(subtext).eq(1).text(); var comments = $(subtext).eq(2).text();

Хранить извлечённые данные для дальнейшей их обработки можно в MongoDB или Redis.

Так выглядит полный код файла scrape.js:

Tags: cheerio, Node.js, request

Источник

Best JavaScript code snippets using cheerio.load(Showing top 4 results out of 315)

var parseWebsite = async function parseWebsite(website) {
 try {
  var url = normalizeUrl(website);
  var html = await fetch(url).then(function (res) {
   return res.text();
  });
  var $ = load(html);
  var handles = {};
  module.exports.CONFIG.socialNetworks.forEach(function (socialNetwork) {
   handles[socialNetwork] = parse(socialNetwork)($);
  });
  return handles;
 } catch (error) {
  throw new Error('Error fetching website data for ' + website + ': ' + error.message);
 }
}

const rog = async (url: string, parsers: Record<string, RogPlugin>): Promise<RogResponse> => {
 if (!isURL(url)) {
  throw new Error(`URL is invalid: ${url}`);
 }

 if (isBinaryPath(url)) {
  throw new Error(`Binary is not supported: ${url}`);
 }

 const response = await got(url, {
  encoding: undefined,
  timeout: 2000,
  responseType: 'buffer'
 });
 const body = getBody(response);

 if (!isHTML(body)) {
  throw new Error('Response is not HTML');
 }

 const $: CheerioStatic = load(body);
 const data: RogResponse = {};
 for (const [key, parse] of Object.entries(parsers)) {
  data[key] = parse($, url) ?? '';
 }

 return data;
}

const parseWebsite = async website => {
 try {
  const url = normalizeUrl(website)
  const html = await fetch(url).then(res => res.text())
  const $ = load(html)
  let handles = {}
  module.exports.CONFIG.socialNetworks.forEach(socialNetwork => {
   handles[socialNetwork] = parse(socialNetwork)($)
  })
  return handles
 } catch (error) {
  throw new Error(
   `Error fetching website data for ${website}: ${error.message}`,
  )
 }
}

const rog = async (url: string, parsers: Record<string, RogPlugin>): Promise<RogResponse> => {
 if (!isURL(url)) {
  throw new Error(`URL is invalid: ${url}`);
 }

 if (isBinaryPath(url)) {
  throw new Error(`Binary is not supported: ${url}`);
 }

 const response = await got(url, {
  encoding: undefined,
  timeout: 2000,
  responseType: 'buffer'
 });
 const body = getBody(response);

 if (!isHTML(body)) {
  throw new Error('Response is not HTML');
 }

 const $: CheerioStatic = load(body);
 const data: RogResponse = {};
 for (const [key, parse] of Object.entries(parsers)) {
  data[key] = parse($, url) ?? '';
 }

 return data;
}

Источник

Требования

Установка модулей cheerio и request

Web scraping страницы

Best JavaScript code snippets using cheerio.load(Showing top 4 results out of 315)

Читайте также: