Python 3.5: Web-scraping with Stripping html codes -
i scraping web content stuck problem. after series of processing strip scope want, cannot strip html code make plain text in list. have tried using function of replace, re.compile , join (try change list text stripping). doesn't work designed string or pops out errors when running.
could give me hint how that. example, want output following code change from
<p class="course-d-title">instructor</p>
to instructor
.
import tkinter tk import re def test(): bs4 import beautifulsoup import urllib.request urllib.parse import urljoin '''for layer 0''' url_text = 'http://www.scs.cuhk.edu.hk/en/part-time/accounting-and-finance/accounting-and-finance/fundamental-accounting/162-610441-01' resp = urllib.request.urlopen(url_text) soup = beautifulsoup(resp, from_encoding=resp.info().get_param('charset')) = soup.find_all('p') k=0 item in a[:]: if 'instructor' in item: a=a[k:] break k+=1 j=0 item in a[:]: if 'enquiries' in item: a=a[:j-1] break j+=1 in range(0,a.__len__()): print (a[i]) if __name__ == '__main__': test()
use .text
extract text bs4 element
>>> = soup.find_all('p') >>> data = [ item item in if 'instructor' in item] [<p class="course-d-title">instructor</p>] >>> data[0].text 'instructor'
Comments
Post a Comment