PHP/MySQL fulltext search relevancy formula update

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Phil Powell

    PHP/MySQL fulltext search relevancy formula update

    The table already has a fulltext index and from there I can use the
    MySQL fulltext search query to get results as well as the relevancy
    score.

    The problem I have is that MySQL has a default setting whereby the
    minimum amount of characters is 4 for a search. Being that we're
    government and full of TLA (three-letter acronyms), that is not
    practical, and furthermore, the app I'm building must be fully
    portable, so having MySQL tweaked everywhere this app goes is a
    useless exercise.

    Hence, for 3-letter words I have to fudge together a relevancy score
    to match that of larger and more commonly accepted words.

    Here is the function I created, calculate_relev ancy(), with notes I
    got from a posting by MySQL AB's Sergei Golubchik on the exact formula
    (according to him):

    [PHP]
    /**
    * Calculate the relevancy score of a search query if the "SELECT
    MATCH () AGAINST ()" fulltext indexing query is unavailable
    *
    * @access public
    * @param mixed $sql SQL query resulting in the resultset used for
    calculating $qcount value
    * @param object $result Resultset
    * @param object $row Row of resultset to calculate relevancy
    * @param mixed $keyword
    * @param int $rows (conditional optional) number of rows in table -
    must exist if $tableName is not passed
    * @param mixed $tableName (optional) name of table to perform
    counting query to calculate $rows value
    * @return float $score
    * @see DBConnection
    * @see MySQLQuery
    * @see http://groups.google.c om/groups?q=mysql+ fulltext+maximu m+score&hl=en&l r=&selm=9tt6u1% 24sca%241%40Fre eBSD.csie.NCTU. edu.tw&rnum=1
    */
    function calculate_relev ancy($sql, $result, $row, $keyword, $rows = 0,
    $tableName = '') {
    /*---------------------------------------------------------------------------------------------------------------------------
    Explanation of integer variables:

    1) $w1 = intermediate weight
    2) $dcount = number of times word is present in document (or query
    row)
    3) $sumw1 = sum of all values of $w1 found
    4) $uniq = number of unique words in the document (or query row)
    5) $rows = total number of rows in the table (as int parameter or
    calculated via $tableName query
    6) $found = number of rows in the table that contain the word in
    question (@sizeof($resul t))
    7) $qcount = number of times this word is present in the query
    statement in $sql parameter
    ----------------------------------------------------------------------------------------------------------------------------*/
    global $section;
    if (!$tableName && (int)$rows === 0) return 0; // TO PREVENT DIVISION
    BY ZERO
    if ((int)$rows == 0 && (!class_exists( 'DBConnection') ||
    !class_exists(' MySQLQuery'))) return 0; // CLASSES MUST EXIST TO
    RE-CALCULATE $rows

    // INITIALIZE VARS, ARRAYS, ETC.
    list($w1, $dcount, $sumw1, $uniq, $found, $qcount) = array(0, 0, 0,
    0, 0, 0);
    $keyword = trim(preg_repla ce('/[\s\t]{2,}/i', ' ', $keyword)); //
    TRIM EXTRA WHITESPACE FOR MORE ACCURATE COUNT
    $keywordArray = explode(' ', $keyword); // CONVERT INTO ARRAY TO
    SEARCH WEIGHT FOR EACH INDIVIDUAL KEYWORD

    if ((int)$rows !== 0 && $rows == $found) return 0; // TO PREVENT
    DIVISION BY ZERO

    // CALCULATE $uniq
    $wordComparator Array = array();
    foreach (@array_keys(ge t_object_vars($ row)) as $field) {
    if ($row->$field && !in_array(strto lower($row->$field),
    $wordComparator Array)) {
    array_push($wor dComparatorArra y, trim(strtolower ($row->$field)));
    $uniq++;
    }
    }

    // CALCULATE $qcount
    foreach ($keywordArray as $word)
    $qcount += @substr_count($ sql, strtolower($wor d)); // NUMBER OF
    INSTANCES OF EACH KEYWORD IN SQL QUERY

    // CALCULATE $sumw1, EACH VALUE OF $dcount, $w1 PER KEYWORD PARTITION
    PER RESULTSET ROW
    @reset($keyword Array); // RESET KEYWORD ARRAY TO ENSURE
    ACCURATE SCAN
    for ($i = 0; $i < @sizeof($result ); $i++) { // SCAN EVERY ROW
    foreach ($keywordArray as $word) { // SCAN EACH COLUMN FIELD
    PER KEYWORD ARRAY OF WORDS
    foreach (@array_keys(ge t_object_vars($ row)) as $field) { // SCAN
    EACH COLUMN FIELD
    if (!$isSetDCount) $dcount +=
    @substr_count(s trtolower($row->$field), strtolower($wor d));
    $isSetDCount = false;
    }
    $w1 = log((float)$dco unt) + 1;
    if (!$isSetSumW1) $sumw1 += $w1;
    $dcount = 0;
    }
    list($isSetDCou nt, $isSetSumW1) = array(true, true);
    }

    // CALCULATE $found
    $found = @sizeof($result );


    // CALCULATE $rows IF NOT PROVIDED AS PARAMETER
    if ((int)$rows === 0) {
    $query = new MySQLQuery("SEL ECT count(*) AS rows FROM $tableName
    LIMIT 1", $dbConn);
    $tempTableRowCo untResult = $query->getResult();
    $query = null;
    $dbConnObj->close();
    $dbConn = null;
    $dbConnObj = null;
    $rows = $tempTableRowCo untResult[0]->rows;
    }

    print_r($row); print_r("<P>");
    print_r("$w1 / $sumw1 * $uniq / (1 + 0.0115 * $uniq) *
    log((float)(($r ows - $found) / $found)) * $qcount<P>");
    print_r(($w1 / $sumw1) . ' ' . ($uniq / (1 + 0.0115 * $uniq)) . ' ' .
    (log((float)(($ rows - $found) / $found))) . ' ' . $qcount . "<P>");
    $score = (float)(($w1 / $sumw1) * ($uniq / (1 + 0.0115 * $uniq)) *
    log((float)(($r ows - $found) / $found)) * $qcount);
    return $score;

    }

    [/PHP]

    Problem is that the scores that come back, even if tested for
    4+-letter words, are about 10X the size of an actual relevancy score
    for 4+-letter words (for 3 letter words it is also anywhere between
    7.75 and 30, which I unfortunately cannot compare against with an
    actual relevancy score as a 3-letter word is too small for fulltext
    query in default settings in MySQL 4.0+).

    Just a heads-up, I am not sure what to do at this point.

    Thanx
    Phil
Working...