I'm attempting to create a scraper in node.js that will enable me to pull news headlines from a lot of various sources (they are all different so I have to be as general as possible in my approach). I currently have a functioning Python code that makes use of Beautiful Soup and regex to let me define a set of keywords and return headlines that contain those keywords. A pertinent section of Python code is shown below:
for items in soup(text=re.compile(r'\b(?:%s)\b' % '|'.join(keywords)))
To illustrate the expected output, lets assume there is a domain with news articles (Bellow is a html snippet containing a headline):
<a class="gs-c-promo-heading gs-o-faux-block-link__overlay-link gel-pica-bold nw-o-link-split__anchor" href="/news/uk-52773032"><h3 class="gs-c-promo-heading__title gel-pica-bold nw-o-link-split__text">Time to end Clap for Carers, says founder</h3></a>
The expected output given a keyword Time would be a string with a headline Time to end Clap for Carers
My question is: is it possible to do a similar thing with cheerio? What would be the best approach to achieve the same results in nodejs?
EDIT: This works for me now. On top of matching headlines I also wanted to extract post urls
function match_headlines($) {
const keywords = ['lockdown', 'quarantine'];
new RegExp('\\b[A-Z].*?' + "(" + test_keywords.join('|') + ")" +
'.*\\b', "g");
let matches = $('a').map((i, a) => {
let links = $(a).attr('href');
let match = $(a).text().match(regexPattern);
if (match !== null) {
let posts = {
headline: match['input'],
post_url: links
}
return posts
}
})
return matches.filter((x) => x !== null)
}