I'm attempting to construct a scraper using Node.js that will allow me to extract news headlines from a huge number of websites (they are all different so I have to be as general as possible in my approach). I now have a functioning Python code that uses Beautiful Soup and regex to allow me to declare a set of keywords and return headlines that contain those keywords. A related sample of python code is provided below:
for items in soup(text=re.compile(r'\b(?:%s)\b' % '|'.join(keywords)))
To illustrate the expected output, lets assume there is a domain with news articles (Bellow is a html snippet containing a headline):
<a class="gs-c-promo-heading gs-o-faux-block-link__overlay-link gel-pica-bold nw-o-link-split__anchor" href="/news/uk-52773032"><h3 class="gs-c-promo-heading__title gel-pica-bold nw-o-link-split__text">Time to end Clap for Carers, says founder</h3></a>
The expected output given a keyword Time would be a string with a headline Time to end Clap for Carers
I'm wondering if it's possible to accomplish something similar using cheerio. What would be the best strategy in nodejs to get the similar results?
EDIT: This is now working for me. On top of that, there are headlines that match. I also wanted to get the URLs of the posts.
function match_headlines($) {
const keywords = ['lockdown', 'quarantine'];
new RegExp('\\b[A-Z].*?' + "(" + test_keywords.join('|') + ")" +
'.*\\b', "g");
let matches = $('a').map((i, a) => {
let links = $(a).attr('href');
let match = $(a).text().match(regexPattern);
if (match !== null) {
let posts = {
headline: match['input'],
post_url: links
}
return posts
}
})
return matches.filter((x) => x !== null)
}