Using PHP to calculate MySQL fulltext relevancy scores by hand

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Phil Powell

    Using PHP to calculate MySQL fulltext relevancy scores by hand

    Relevancy scores are normally defined by a MySQL query on a table that
    has a fulltext index. The rules for relevancy scoring will exclude
    certain words due to their being too short (minimum default is 4
    letters).

    This is the Fed. Everything is a TLA (three-letter acronym).

    Therefore, since I'm building a PORTABLE web application, changing
    MySQL's default settings for fulltext index querying is completely
    undoable and unrealistic, so I created a "fake fulltext query", that
    is, a plain query that masquerades as if it were a fulltext index
    (only a lot slower but there is nothing I can do about it).

    However, the client wants the exact same relevancy score as would be
    found in a fulltext query. So I wrote a function that should
    calculate the relevancy score based upon information provided by MySQL
    AB's Sergei Golubchik. Here it is:

    [PHP]
    /**
    * Calculate the relevancy score of a search query if the "SELECT
    MATCH () AGAINST ()" fulltext indexing query is unavailable
    *
    * @access public
    * @param object $result Resultset
    * @param object $row Row of resultset to calculate relevancy
    * @param mixed $keyword
    * @param mixed $tableName name of table to perform counting query
    to calculate $rows value
    * @return float $score
    * @see DBConnection
    * @see MySQLQuery
    * @see http://groups.google.c om/groups?q=mysql+ fulltext+maximu m+score&hl=en&l r=&selm=9tt6u1% 24sca%241%40Fre eBSD.csie.NCTU. edu.tw&rnum=1
    */
    function calculate_relev ancy($result, $row, $keyword, $tableName) {
    /*-------------------------------------------------------------------------------------------------------------------------
    Explanation of integer variables:

    1) $w1 = intermediate weight
    2) $dcount = number of times word is present in document (or query
    row)
    3) $sumw1 = sum of all values of $w1 found
    4) $uniq = number of unique words in the document (or query row)
    5) $rows = total number of rows in the table
    6) $found = number of rows in the table that contain the word in
    question (@sizeof($resul t))
    7) $qcount = number of times this word is present in the query
    -------------------------------------------------------------------------------------------------------------------------*/

    list($w1, $dcount, $sumw1, $uniq, $rows, $found, $qcount) = array(0,
    0, 0, 0, 0, 0, 0);

    $isSetUniq = false;

    for ($i = 0; $i < @sizeof($result ); $i++) {
    foreach (@array_keys(ge t_object_vars($ row)) as $field) {
    $dcount += @substr_count($ result[$i]->$field, $keyword);
    if (!$isSetUniq) $uniq +=
    @sizeof(array_f lip(str_word_co unt($row->$field), 1));
    }
    $w1 = log((float)$dco unt) + 1;
    $sumw1Array[$i] += $w1;
    $qcount += $sumw1Array[$i];
    $dcount = 0;
    $isSetUniq = true;
    }
    $found = @sizeof($result );

    if (!isset($rows)) {
    static $rows = 0;
    global $dbHost, $dbPort, $dbUser, $dbPwd, $dbDefaultName;
    $dbConnObj =& new DBConnection($d bHost, $dbPort, $dbUser, $dbPwd,
    $dbDefaultName) ;
    $dbConn = $dbConnObj->connect();
    $query =& new MySQLQuery('SEL ECT count(r.*) AS rows FROM ' .
    $this->getTempNameNam e() . ' r LIMIT 1', $dbConn);
    $tempTableRowCo untResult = $query->getResult();
    $query = null;
    $dbConnObj->close();
    $dbConn = null;
    $dbConnObj = null;
    $rows = $tempTableRowCo untResult[0]->rows;
    }

    $score = (float)($w1 / $sumw1 * $uniq / (1 + 0.0115 * $uniq) *
    log((float)(($r ows - $found) / $found)) * $qcount);
    return $score;

    }

    [/PHP]

    This completely fails because it produces NAN due to $sumw1 always
    equaling INF (infinity). I am not math-savvy enough to know how to
    fix this problem so I appeal to higher intelligence for help on this.

    Thanx
    Phil
  • Justin Koivisto

    #2
    Re: Using PHP to calculate MySQL fulltext relevancy scores by hand

    Phil Powell wrote:
    [color=blue]
    > Relevancy scores are normally defined by a MySQL query on a table that
    > has a fulltext index. The rules for relevancy scoring will exclude
    > certain words due to their being too short (minimum default is 4
    > letters).
    >
    > This is the Fed. Everything is a TLA (three-letter acronym).
    >
    > Therefore, since I'm building a PORTABLE web application, changing
    > MySQL's default settings for fulltext index querying is completely
    > undoable and unrealistic, so I created a "fake fulltext query", that
    > is, a plain query that masquerades as if it were a fulltext index
    > (only a lot slower but there is nothing I can do about it).
    >
    > However, the client wants the exact same relevancy score as would be
    > found in a fulltext query. So I wrote a function that should
    > calculate the relevancy score based upon information provided by MySQL
    > AB's Sergei Golubchik. Here it is:
    >
    > [PHP][/color]

    <snip code>
    [color=blue]
    > $score = (float)($w1 / $sumw1 * $uniq / (1 + 0.0115 * $uniq) *
    > log((float)(($r ows - $found) / $found)) * $qcount);[/color]

    try replacing the $sumw1 with:

    array_sum($sumw 1Array)

    --
    Justin Koivisto - spam@koivi.com
    http://www.koivi.com

    Comment

    • Phil Powell

      #3
      Re: Using PHP to calculate MySQL fulltext relevancy scores by hand

      Justin Koivisto <spam@koivi.com > wrote in message news:<VY6ld.154 $SO1.6333@news7 .onvoy.net>...[color=blue]
      > Phil Powell wrote:
      >[color=green]
      > > Relevancy scores are normally defined by a MySQL query on a table that
      > > has a fulltext index. The rules for relevancy scoring will exclude
      > > certain words due to their being too short (minimum default is 4
      > > letters).
      > >
      > > This is the Fed. Everything is a TLA (three-letter acronym).
      > >
      > > Therefore, since I'm building a PORTABLE web application, changing
      > > MySQL's default settings for fulltext index querying is completely
      > > undoable and unrealistic, so I created a "fake fulltext query", that
      > > is, a plain query that masquerades as if it were a fulltext index
      > > (only a lot slower but there is nothing I can do about it).
      > >
      > > However, the client wants the exact same relevancy score as would be
      > > found in a fulltext query. So I wrote a function that should
      > > calculate the relevancy score based upon information provided by MySQL
      > > AB's Sergei Golubchik. Here it is:
      > >
      > > [PHP][/color]
      >
      > <snip code>
      >[color=green]
      > > $score = (float)($w1 / $sumw1 * $uniq / (1 + 0.0115 * $uniq) *
      > > log((float)(($r ows - $found) / $found)) * $qcount);[/color]
      >
      > try replacing the $sumw1 with:
      >
      > array_sum($sumw 1Array)[/color]

      I had the formula interpretation wrong. $sumw1 is the sum of all $w1
      values which are the weights of each word that comprises the compound
      keyword, for example, if your keyword is "Hello World" then you have
      $w1 for "Hello" and $w1 for "World".

      Phil

      Comment

      Working...