Boldt: Python Scraping, how to get text from this through BeautifulSoup?

Thursday, 8 August 2013

Python Scraping, how to get text from this through BeautifulSoup?

Python Scraping, how to get text from this through BeautifulSoup?

well here is my code to scrape text content from a site.... well it is
working though i am not getting plane text only.... how to handle that
from bs4 import BeautifulSoup
import mechanize
def getArticle(url):
br = mechanize.Browser()
htmltext = br.open(url).read()
soup = BeautifulSoup(htmltext)
for tag in soup.findAll('span',{'itemprop':'articleBody'}):
print tag.contents
for example when i scrape a site. i get this output
"[u"\nIn Soviet Russia, it's the banks that pay customers' bills.\xa0Or,
at least, one might.",
, u'\n',
, u'\r\nAn interesting case has surfaced in Voronezh, Russia, where a man
is suing a bank for more than 24 million Russian rubles (about $727,000)
in compensation over a handcrafted document that was signed and recognized
by the bank.\xa0',
, u'\n',
, u'\r\nA person who goes by name Dmitry Alexeev (his surname was changed
', by the first Russian outlet to publish this story, u') said that in
2008 he received a letter from ', Tinkoff Credit Systems, u'\xa0in his
mailbox. It was a credit card application form with an agreement contract
enclosed, much like the applications Americans receive daily from various
banks working with ', Visa
how to get plain text only?

Boldt

Thursday, 8 August 2013

Python Scraping, how to get text from this through BeautifulSoup?

No comments:

Post a Comment