QUERY: comparing website contents

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • andrewwan1980
    New Member
    • Nov 2007
    • 7

    QUERY: comparing website contents

    I've got two websites, one original, the other based off the original.

    I like to diff/compare the websites using diff automatic comparison tools to see what text/information has changed. The problem is, the HTML code and layout has been changed drastically so I can't do a straight text file compare. What am interested in is purely the raw content (paragraphs, sentences, etc.). The original site has no javascript, onmouseover hovers, etc. The new revamped website has javascript, onmouseover hovers, popups, etc.

    How can I create a script (Perl? C++?) so that it extracts the main text BODIEs from both sites? I guess also have to specify starting & ending delimiters. Once extracted, it would need to convert < p ></ p > paragraph tags, and strip out < a onmouseover... > anchor links (while maintaining the word inbetween the anchor link ofcourse). The new website uses two spaces after each full stop while the old website uses 1 space. Will this matter?

    Once we got the plain text, how to wrap the paragraphs after 80 characters per line... so that we can easily do file compares.



    And please do not suggest copying and pasting the text into NotePad or Word. I said 'website' which means they contain dozens of html files (probably 100s). Plus, I like a script to automate this compare process so I can repeat the process in future and remind myself of diffs....
  • acoder
    Recognized Expert MVP
    • Nov 2006
    • 16032

    #2
    Originally posted by andrewwan1980
    How can I create a script (Perl? C++?)...
    You said it yourself. If you know Perl, I can send this over to the Perl forum. JavaScript can't really do this.

    Comment

    • andrewwan1980
      New Member
      • Nov 2007
      • 7

      #3
      I need a tool to get me the substring between delimiters then 79char

      line wrap the result and then diff... for both oldsite/old1.htm and

      newsite/new1.htm

      As for web crawling, old site is local, new site is online. But I

      rather hard code the URLs in a big list (mapping).

      I think I'll use Perl (maybe Python), to:

      1. for each item in mapping list
      1.1 download newsite/html file
      1.2 substring using newsite delimiters on newsite file
      1.3 substring using oldsite delimiters on oldsite file
      1.3 html2txt/hindent both oldsite & newsite file and line wrap 79char

      and put into 2 separate new folders (diff1, diff2).
      1.4 repeat through mapping list

      After that I can use Beyond Compare to compare the diff1 & diff2

      folders. Hopefully both corresponding text files will be 79char line

      wrapped with whitespace down to 1 char (eliminating 2 or more

      consecutive spaces, and tab spaces). Also maintain carriage returns?

      Comment

      • rnd me
        Recognized Expert Contributor
        • Jun 2007
        • 427

        #4
        if you just want to compare visible contents (not html/js markup changes), i would think comparing the textContent/innerText of the body tag would be easiest.

        i don't understand whay this could not be done in javascript, but perhaps i misunderstand your question.

        Comment

        • acoder
          Recognized Expert MVP
          • Nov 2006
          • 16032

          #5
          Originally posted by rnd me
          if you just want to compare visible contents (not html/js markup changes), i would think comparing the textContent/innerText of the body tag would be easiest.

          i don't understand whay this could not be done in javascript, but perhaps i misunderstand your question.
          Technically, it could be done in JavaScript, but with two domains, some server-side code will have to be involved.

          Comment

          Working...