scrape info from web:the .text problem

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • luofeiyu
    New Member
    • Jul 2011
    • 18

    scrape info from web:the .text problem

    hi ,everyone,i want to scrape something from
    http://search.dangdang .com/search_pub.php? key=python
    my code is :
    Code:
    import urllib
    import lxml.html
    down='http://search.dangdang.com/search_pub.php?key=python'
    file=urllib.urlopen(down).read()
    root=lxml.html.fromstring(file)
    tnodes = root.xpath("//div[@class='listitem detail']//li[@class='maintitle']//a")
    for i,x in  enumerate(tnodes):
       print i,"  ",x.get('name'),x.get('href'),x.get('onclick'),x.text,"\n"
    the output is :
    0 p_name http://product.dangdan g.com/product.aspx?pr oduct_id=208723 65&ref=search-1-pub s('click','pyth on','01.54.06.1 8','','86_1_25' ,'','','2087236 5_1_22591_p','' ,'',''); None

    1 p_name http://product.dangdan g.com/product.aspx?pr oduct_id=202553 54&ref=search-1-pub s('click','pyth on','01.54.06.1 8','','86_1_25' ,'','','2025535 4_2_12605_p','' ,'',''); None

    2 p_name http://product.dangdan g.com/product.aspx?pr oduct_id=208365 65&ref=search-1-pub s('click','pyth on','01.54.06.1 8','','86_1_25' ,'','','2083656 5_3_2361_p','', '',''); None

    3 p_name http://product.dangdan g.com/product.aspx?pr oduct_id=210046 15&ref=search-1-pub s('click','pyth on','01.54.06.1 8','','86_1_25' ,'','','2100461 5_4_3387_p','', '',''); None

    4 p_name http://product.dangdan g.com/product.aspx?pr oduct_id=210630 86&ref=search-1-pub s('click','pyth on','01.54.06.1 8','','86_1_25' ,'','','2106308 6_5_18815_p','' ,'',''); None

    5 pr_name http://product.dangdan g.com/product.aspx?pr oduct_id=206784 61&ref=search-1-pub s('click','pyth on','01.54.04.0 3,01.54.06.18', '','86_1_25','' ,'','20678461_6 _3967_p','','', 'RECO'); None

    6 pr_name http://product.dangdan g.com/product.aspx?pr oduct_id=206503 63&ref=search-1-pub s('click','pyth on','01.54.19.0 0','','86_1_25' ,'','','2065036 3_7_62_p','','' ,'RECO'); 黑客之道:漏洞发掘的艺术(原书 第二版)(赠1CD)(电子制品 CD-ROM)(

    7 pr_name http://product.dangdan g.com/product.aspx?pr oduct_id=207679 32&ref=search-1-pub s('click','pyth on','01.54.19.0 0','','86_1_25' ,'','','2076793 2_8_4475_p','', '','RECO'); Binary Hacks――黑客秘笈100选

    8 p_name http://product.dangdan g.com/product.aspx?pr oduct_id=205961 89&ref=search-1-pub s('click','pyth on','01.54.06.1 8','','86_1_25' ,'','','2059618 9_9_639_p','',' ',''); None

    9 p_name http://product.dangdan g.com/product.aspx?pr oduct_id=209476 80&ref=search-1-pub s('click','pyth on','01.54.24.0 0,01.54.06.18', '','86_1_25','' ,'','20947680_1 0_7295_p','','' ,''); None

    10 p_name http://product.dangdan g.com/product.aspx?pr oduct_id=210503 68&ref=search-1-pub s('click','pyth on','01.54.19.0 0','','86_1_25' ,'','','2105036 8_11_7039_p','' ,'',''); None

    11 p_name http://product.dangdan g.com/product.aspx?pr oduct_id=206679 66&ref=search-1-pub s('click','pyth on','01.54.06.1 8','','86_1_25' ,'','','2066796 6_12_383_p','', '',''); None

    12 p_name http://product.dangdan g.com/product.aspx?pr oduct_id=210224 93&ref=search-1-pub s('click','pyth on','01.54.06.1 8','','86_1_25' ,'','','2102249 3_13_5183_p','' ,'',''); None

    13 pr_name http://product.dangdan g.com/product.aspx?pr oduct_id=479654 &ref=search-1-pub s('click','pyth on','01.54.06.0 8,01.54.06.18', '','86_1_25','' ,'','479654_14_ 2095_p','','',' RECO'); Perl语言编程(第三版)

    14 pr_name http://product.dangdan g.com/product.aspx?pr oduct_id=209998 55&ref=search-1-pub s('click','pyth on','01.54.10.0 0','','86_1_25' ,'','','2099985 5_15_6715_p','' ,'','RECO'); 程序员的思维修炼:开发认知潜能 的九堂课

    15 pr_name http://product.dangdan g.com/product.aspx?pr oduct_id=206962 03&ref=search-1-pub s('click','pyth on','01.54.06.0 8','','86_1_25' ,'','','2069620 3_16_31615_p',' ','','RECO'); Perl语言入门(第五版)(原 书名:Learni ng Perl,5/e)

    16 p_name http://product.dangdan g.com/product.aspx?pr oduct_id=206706 43&ref=search-1-pub s('click','pyth on','01.54.06.1 8','','86_1_25' ,'','','2067064 3_17_24_p','',' ',''); 可爱的

    17 p_name http://product.dangdan g.com/product.aspx?pr oduct_id=203622 10&ref=search-1-pub s('click','pyth on','01.54.06.1 8','','86_1_25' ,'','','2036221 0_18_32_p','',' ',''); 学习

    18 p_name http://product.dangdan g.com/product.aspx?pr oduct_id=905323 6&ref=search-1-pub s('click','pyth on','01.54.06.1 8','','86_1_25' ,'','','9053236 _19_4_p','','', ''); 学习

    19 p_name http://product.dangdan g.com/product.aspx?pr oduct_id=208507 80&ref=search-1-pub s('click','pyth on','01.54.06.1 8','','86_1_25' ,'','','2085078 0_20_1055_p','' ,'',''); None

    20 pr_name http://product.dangdan g.com/product.aspx?pr oduct_id=204490 68&ref=search-1-pub s('click','pyth on','01.54.06.0 8','','86_1_25' ,'','','2044906 8_21_38_p','',' ','RECO'); 精通Perl

    21 p_name http://product.dangdan g.com/product.aspx?pr oduct_id=211278 16&ref=search-1-pub s('click','pyth on','01.54.24.0 0,01.54.06.18', '','86_1_25','' ,'','21127816_2 2_12545_p','',' ',''); None

    22 p_name http://product.dangdan g.com/product.aspx?pr oduct_id=211076 33&ref=search-1-pub s('click','pyth on','01.54.06.1 8','','86_1_25' ,'','','2110763 3_23_19245_p',' ','',''); Hadoop权威指南(第2版) 修订升级版

    23 None http://bang.dangdang.c om/product_redirec t.php?product_i d=9317290 None None

    24 p_name http://product.dangdan g.com/product.aspx?pr oduct_id=931729 0&ref=search-1-pub s('click','pyth on','01.54.06.0 6,01.49.01.11,0 1.54.26.00','', '86_1_25','','' ,'9317290_24_81 727_p','','','' ); Java编程思想(第4版)

    25 p_name http://product.dangdan g.com/product.aspx?pr oduct_id=207731 86&ref=search-1-pub s('click','pyth on','01.54.06.1 7','','86_1_25' ,'','','2077318 6_25_80479_p',' ','',''); Android应用开发揭秘

    the problem is x.text ,for example:

    1.
    <a name="p_name" target="_blank" href="http://product.dangdan g.com/product.aspx?pr oduct_id=208723 65&ref=search-1-pub" onclick="s('cli ck','python','0 1.54.06.18','', '86_1_25','','' ,'20872365_1_22 591_p','','','' );">
    <font class="skcolor_ ljg">Python</font>
    基础教程(第2版)
    </a>
    what i want to get is "Python 基础教程(第2版) ",the output is None

    2:
    <a name="p_name" target="_blank" href="http://product.dangdan g.com/product.aspx?pr oduct_id=206706 43&ref=search-1-pub" onclick="s('cli ck','python','0 1.54.06.18','', '86_1_25','','' ,'20670643_17_2 4_p','','',''); ">
    可爱的
    <font class="skcolor_ ljg">Python</font>
    </a>
    what i want to get is "可爱的python" ,the output is 可爱的

    would you mind to tell me how to revise my code?
  • milesmajefski
    New Member
    • Aug 2011
    • 10

    #2
    I know nothing about the libraries or the techniques involved in doing this. I will suggest that maybe those nodes don't have any text to show. I looked at the source on that site in your code, and the html looks a little strange to me. The anchor tag <a> that has class="maintitl e" doesn't have a closing tag. Instead, this appears: <div class="clear"/> and there was no div around that I could match it to. I think your code is good, but the website is using anchor tags in a very strange (probably bad) way.

    Comment

    • dwblas
      Recognized Expert Contributor
      • May 2008
      • 626

      #3
      It appears the you could split on the ")" and parse the next to last element if you want to do it by hand. Otherwise, check BeautifulSoup.
      Code:
      test_it = """<a name="p_name" target="_blank" href="http://product.dangdang.com/product.aspx?product_id=20872365&ref=search-1-pub" onclick="s('click','python','01.54.06.18','','86_1 _25','','','20872365_1_22591_p','','','');">
      <font class="skcolor_ljg">Python</font>
      replaced for us latin-1 users)
      </a>"""
      
      ## convert to a single string and split on the ")"
      test_2 = [x for x in test_it.split("\n")]
      test_list = "".join(test_2).split(")")
      print test_list[-2]
      
      ## select everything between ">" and "<" or the end of the string

      Comment

      Working...