How to extract specific tags in multiple html txt files using python

Question

addresses = []
with open("/rawhtml/greerwilsonchapel.com_executives_contact_us.txt") as fp:
    soup = BeautifulSoup(fp)
    #thumb = soup.find('div',class_="widget widget_text")
    address = soup.find('div',class_="locator-titles").get_text().rstrip('\n').split('\n')
    #address = add.find_All('p').get_text()
    addresses.append(address)
    print(addresses)

pooja · Answer 1 · Aug 5, 2020

Hello, @Pooja,

Even I got the same issue, and the below given has helped me, I hope it will be helpful to you.

import urllib
from bs4 import BeautifulSoup

url = "http:Abc.stm"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)

answered Aug 5, 2020 by Kedaar Thomas

But How can I only fetch address from different different files, because all files having address but in different tags or class name or id's etc.

commented Aug 5, 2020 by pooja
• 120 points

Hello @ pooja,

You can extract using css selector.

When we use CSS Selectors, we do not need to know in advance what the content we want looks like (as we might with regular expressions, where specify the pattern of the data). Since HTML documents are structured as a network of nodes, CSS Selectors make use of that structure to navigate through the nodes and select the data we want. We just need to know which nodes in an HTML file contain what we want to extract.

You can refer this to know how it works!!

Hope this is helpfull to you!!
Thank you!!

commented Aug 5, 2020 by Niroj
• 82,840 points

Yes I am doing the same thing but every file having different CSS selectors name. So in this case how can I fetch addresses of every company by giving CSS selectors name.

commented Aug 7, 2020 by pooja
• 120 points

Hello @pooja,

You can refer this for your releted queries.

Hope this help you!!

commented Aug 7, 2020 by Niroj
• 82,840 points

I go through this, but it didn't get my solution.

I want something which can fetch only contact us details from different different files, Is there any NLP libraries which provide this solution ???

commented Aug 7, 2020 by pooja
• 120 points

Hello @pooja,

Have you install Beautiful Soup? I'd recommend you learn beautiful soup
It is a python library that can let you extract tags and or text in them.

Also there is a requests_html library. Which some people can find better than beautiful soup.
Also there's an urllib3 which also designed for processing web requests
I'd recommend to read about them and choose what suits you best.

commented Aug 7, 2020 by Niroj
• 82,840 points

Hellp @Niroj,

I read BeautifulSoup but it is helpful in extracting tags in html, but what we have to do is we want to extract addresses from different different files and every files having diffrent class and id names for extracting specific address, and it is not possible to give 1000 of class name by hardcode it.

commented Aug 10, 2020 by pooja
• 120 points

Hello @pooja,

Try selenium's xpath (useful method to locate an element is using an XPath expression. We use XPath when a proper id or name attribute is not present in the code to access that element.)
If you don't get your problem solved with that I guess you've to do it manually
Or you can write a script to get raw data and find a pattern for addresses to extract them
Eg. If they're gmail address, extract them using something.endswith("gmail.com")
Or use regular expressions

commented Aug 10, 2020 by Niroj
• 82,840 points

Thanx @Niroj

I am using regular expression only, Because I think this is the only way to get results. And if you have any more solutions then please tell me.