Teaching a Crawler to Identify a Blog

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Metropolis

    Teaching a Crawler to Identify a Blog

    Hello All,

    I am currently trying to teach a web crawler how to identify blogs,
    that is I am trying to determine a fairly inclusive set of criteria
    that will help my crawler to identify them.

    I have noticed that many Blogs include

    div class=blogsomet hing (A format class conveniantly named blog)

    xml tags

    and/or php code.

    I do know that cms(content management system) is used for several
    blogs, does anyone else have any suggestions to help me determine
    criteria.

    I am aware that any criteria is subjective, especially when
    considering sites such as slashdot which has been around longer than
    Blogs...

    thanks,
    David
  • Chris Hope

    #2
    Re: Teaching a Crawler to Identify a Blog

    Metropolis wrote:
    [color=blue]
    > I am currently trying to teach a web crawler how to identify blogs,
    > that is I am trying to determine a fairly inclusive set of criteria
    > that will help my crawler to identify them.
    >
    > I have noticed that many Blogs include
    >
    > div class=blogsomet hing (A format class conveniantly named blog)[/color]

    Maybe *some* blogs contain this tag, but I'm betting most don't.
    [color=blue]
    > xml tags[/color]

    So do lots of other websites, and I'm betting other websites have 'em more
    than blogs do.
    [color=blue]
    > and/or php code.[/color]

    How can you tell it's a PHP document? You can't see any PHP code because
    what you are served up is a static HTML page. The only hint you can have is
    that the file extension ends with .php but not all PHP pages end in .php

    In any case just because it's PHP doesn't make it a blog.


    I think you'll need to do a lot more than your suggestions here to determine
    if it's a blog or not.

    A lot of them do have date boxes on the page somewhere so you can navigate
    back to previous days postings. Things like this, and other elements that
    are common to blogs, are what you should be looking for, and not stuff like
    whether it contains XML style tags or PHP file extensions.

    --
    Chris Hope - The Electric Toolbox - http://www.electrictoolbox.com/

    Comment

    • Razzbar

      #3
      Re: Teaching a Crawler to Identify a Blog

      Chris Hope <blackhole@elec trictoolbox.com > wrote in message news:<110128330 8_7378@216.128. 74.129>...[color=blue]
      > Metropolis wrote:
      >[color=green]
      > > I am currently trying to teach a web crawler how to identify blogs,
      > > that is I am trying to determine a fairly inclusive set of criteria
      > > that will help my crawler to identify them.
      > >
      > > I have noticed that many Blogs include
      > >
      > > div class=blogsomet hing (A format class conveniantly named blog)[/color]
      >
      > Maybe *some* blogs contain this tag, but I'm betting most don't.
      >[color=green]
      > > xml tags[/color]
      >
      > So do lots of other websites, and I'm betting other websites have 'em more
      > than blogs do.
      >[color=green]
      > > and/or php code.[/color]
      >
      > How can you tell it's a PHP document? You can't see any PHP code because
      > what you are served up is a static HTML page. The only hint you can have is
      > that the file extension ends with .php but not all PHP pages end in .php
      >
      > In any case just because it's PHP doesn't make it a blog.[/color]

      All true.

      Start by thinking about how -you- identify a blog. That ain't easy,
      if my attempts at explaining what a blog is to other people is any
      indication.

      Look for references to time and self. E.g. "yesterday, I"

      What IS a blog, anyway?

      Not duck soup, or a piece of cake, this problem.

      Comment

      Working...