python - How to deal with problematic encoding while webscraping? -


i trying scrape , merge contents of multiple tables each on separate webpage. have read lot encoding , unicode including links, can't figure out if i've missed or if there problem encoding on webpage. in first link, can see date 10/31/2014 brand name column reads "pear’s gourmet", lot of other strings come out funny apostrophes "children’s medical ventures, llc" (instead of "children's...). can see funny apostrophes in ipython, come out in csv file ’.

my questions are:

  1. am doing wrong encoding apostrophes coming out wrong?
  2. if not, how replace wrong characters apostrophe?

i have tried make reproducible code below.

#import libraries import sys #import ipython print(sys.version_info[0:30])       #python 2.7.11 #print(ipython.version_info)       #ipython 4.0.1 import pandas pd bs4 import beautifulsoup #from lxml import html import requests import os cwd = os.getcwd()  #generate dataframe , lists df = pd.dataframe() a=[] b=[] c=[] d=[] e=[] f=[]  #scrape number of separate webpages contain tables given year pstr1 = "http://www.fda.gov/safety/recalls/archiverecalls/"     #for in range(2006,2017): in range(2014,2015):       = ["/default.htm","/default.htm?page="]     pagename = pstr1 + str(i) + a[0]     print pagename     r = requests.get(pagename)     r.raise_for_status()     #print(page.encoding)     r.encoding = 'utf-8'     page = beautifulsoup(r.text)     npages = page.select('.pagination-clean a')       #scrape data each table , combine dataframe     j in range(len(npages)):         pagename = pstr1 + str(i) + a[1] + str(j+1)         print pagename         r = requests.get(pagename)         r.encoding = 'utf-8'         soup = beautifulsoup(r.text)         t1=soup.find('table')          row in t1.findall("tr"):             cells = row.findall('td')              if len(cells)!=0: #ignore heading                  a.append(cells[0].find(string=true))                 b.append(cells[1].find(string=true))                 c.append(cells[2].find(string=true))                 d.append(cells[3].find(string=true))                 e.append(cells[4].find(string=true))                 f.append(cells[5].find(string=true))                  #examine problematic characters                 try:                     cells[1].find(string=true).decode('utf-8')                     #print "string utf-8, length %d bytes" % len(cells[1].find(string=true))                 except unicodeerror:                     print "string not utf-8"                     #print(cells[1].find(string=true))  df=pd.dataframe(a, columns=['date']) df['brand_name']=b df['product_description']=c df['reason_problem']=d df['company']=e df['details_photo']=f df.to_csv(cwd+'/table1.csv', encoding='utf-8') 


Comments

Popular posts from this blog

mysql - Dreamhost PyCharm Django Python 3 Launching a Site -

java - Sending SMS with SMSLib and Web Services -

java - How to resolve The method toString() in the type Object is not applicable for the arguments (InputStream) -