python - How to deal with problematic encoding while webscraping? -

- January 15, 2012

i trying scrape , merge contents of multiple tables each on separate webpage. have read lot encoding , unicode including links, can't figure out if i've missed or if there problem encoding on webpage. in first link, can see date 10/31/2014 brand name column reads "pearâ€™s gourmet", lot of other strings come out funny apostrophes "children’s medical ventures, llc" (instead of "children's...). can see funny apostrophes in ipython, come out in csv file â€™.

my questions are:

am doing wrong encoding apostrophes coming out wrong?
if not, how replace wrong characters apostrophe?

i have tried make reproducible code below.

#import libraries import sys #import ipython print(sys.version_info[0:30])       #python 2.7.11 #print(ipython.version_info)       #ipython 4.0.1 import pandas pd bs4 import beautifulsoup #from lxml import html import requests import os cwd = os.getcwd()  #generate dataframe , lists df = pd.dataframe() a=[] b=[] c=[] d=[] e=[] f=[]  #scrape number of separate webpages contain tables given year pstr1 = "http://www.fda.gov/safety/recalls/archiverecalls/"     #for in range(2006,2017): in range(2014,2015):       = ["/default.htm","/default.htm?page="]     pagename = pstr1 + str(i) + a[0]     print pagename     r = requests.get(pagename)     r.raise_for_status()     #print(page.encoding)     r.encoding = 'utf-8'     page = beautifulsoup(r.text)     npages = page.select('.pagination-clean a')       #scrape data each table , combine dataframe     j in range(len(npages)):         pagename = pstr1 + str(i) + a[1] + str(j+1)         print pagename         r = requests.get(pagename)         r.encoding = 'utf-8'         soup = beautifulsoup(r.text)         t1=soup.find('table')          row in t1.findall("tr"):             cells = row.findall('td')              if len(cells)!=0: #ignore heading                  a.append(cells[0].find(string=true))                 b.append(cells[1].find(string=true))                 c.append(cells[2].find(string=true))                 d.append(cells[3].find(string=true))                 e.append(cells[4].find(string=true))                 f.append(cells[5].find(string=true))                  #examine problematic characters                 try:                     cells[1].find(string=true).decode('utf-8')                     #print "string utf-8, length %d bytes" % len(cells[1].find(string=true))                 except unicodeerror:                     print "string not utf-8"                     #print(cells[1].find(string=true))  df=pd.dataframe(a, columns=['date']) df['brand_name']=b df['product_description']=c df['reason_problem']=d df['company']=e df['details_photo']=f df.to_csv(cwd+'/table1.csv', encoding='utf-8')

Search This Blog

celery

python - How to deal with problematic encoding while webscraping? -

Comments

Post a Comment

Popular posts from this blog

mysql - Dreamhost PyCharm Django Python 3 Launching a Site -

java - Sending SMS with SMSLib and Web Services -

java - How to resolve The method toString() in the type Object is not applicable for the arguments (InputStream) -