SP to parse string adding synonimes from table

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • inspireuk
    New Member
    • Jul 2007
    • 12

    SP to parse string adding synonimes from table

    The easiest way for me to explain what I'm trying to do is show an example.

    I have the following tables:

    WORDS
    WordID Word
    1 1,000,000
    2 a million
    3 one million

    WORDLINKS
    WordID WordLinkID
    1 2
    1 3
    2 1
    2 3
    3 1
    3 2

    I would then pass the following sentance into the SP
    a million times a day

    What the SP needs to do is parse the string and find all the phrases within that match the words in the WORDS table. It would then take the WordID and do a lookup in the WORDLINKS table to find all the possible alternatives/synonimes.

    The result would be in the following form:
    {a million|1,000,0 00|one million} times a day

    the matched phrase has been wrapped in {} with each matched alternatives/synonimes seperated by a pipe |.

    there would be multiple phrases in each input string and possibly nested words
    eg in the example above the phrase 'one million', 'one' could have an alternative of '1' in which case the output would be:
    {a million|1,000,0 00|{one|1} million} times a day

    i hope this makes sense. any help would really be appreciated.

    Thanks

    Josh
  • azimmer
    Recognized Expert New Member
    • Jul 2007
    • 200

    #2
    It's going to be a long one... (But a promise is a promise.) What makes it a bit more difficult is the nested replacements (whose proper handling I'm still not 100% sure of):
    • have it greedy or non-greedy (if you have replacements both for 'one million' and 'one' which one has precedence)?
    • If any replacement would result in an expanded string that could be further replaced, what to do? (I.e.: replacements: '1'->'one', 'one million'->'a million', what should the output be for '1 million'? And if replacements are '1'->'one', 'one million'->'1,000,000', what should the output be for '1 million'?)

    The code below is greedy and processes nested replacements only to the length of the replaced term and not including the term being replaced (see examples at the end). It has two functions because the SQL Server wouldn't take the default value (don't know why) and I've had no more time to play around with it. You only need to use the one without the underscore.

    So here goes the code:
    Code:
    create function _syno (@instr as varchar(1000), @term_replace as bit = 0)
    returns varchar(1000)
    as
    begin
    declare @pos as int
    declare @current_str as varchar(1000)
    declare @syno_term as varchar(1000)
    declare @replaced_length as int
    declare @outstr as varchar(1000)
    set @outstr=''
    
    set @pos=1
    while @pos<=len(@instr)
    begin
    	-- try from current position on
    	set @current_str=right(@instr,len(@instr)-@pos+1)
    	declare current_replacements cursor local for
    		-- this is tricky
    		select distinct top 1 with ties w.word [syno_term], len(w2.word) [replaced_length]
    		from words w, wordlinks l, words w2
    		where w.wordid=l.wordlinkid
    		and (l.wordid=w2.wordid)
    		and charindex(w2.word,@current_str)=1
    		and (@term_replace=0 or (w2.word<>@current_str))
    		order by len(w2.word) desc
    	open current_replacements
    	fetch next from current_replacements into @syno_term, @replaced_length
    	if @@fetch_status = 0	-- there are replacements
    	begin
    		set @outstr=@outstr+'{'+left(@current_str,@replaced_length)
    		-- iterate on current set of replacements (recursivity)
    		while @@fetch_status = 0
    		begin
    			set @syno_term=dbo._syno(@syno_term, 1)
    			set @outstr=@outstr+'|'+@syno_term
    			fetch next from current_replacements into @syno_term, @replaced_length
    		end
    		set @outstr=@outstr+'}'
    		set @pos=@pos+@replaced_length
    	end
    	else
    	begin
    		set @outstr=@outstr+left(@current_str,1)
    		set @pos=@pos+1
    	end
    	deallocate current_replacements
    end
    return @outstr 
    end
    GO
    
    create function syno (@instr as varchar(1000))
    returns varchar(1000)
    as
    begin
    return dbo._syno(@instr,0)
    end
    GO
    Note: I found the easiest way to get the code itself (without line numbers) is to "Reply" to the message and copy-paste whatever is between '[ CODE]' and '[ /CODE]'.

    My WORDS and WORDLINKS tables are as follows:
    Code:
    WordID      Word                                               
    ----------- -------------------
    1           1,000,000
    2           a million
    3           one million
    4           one
    5           1
    
    WordID      WordLinkID  
    ----------- ----------- 
    1           2
    1           3
    2           1
    2           3
    3           1
    3           2
    4           5
    Examples and usage:
    Code:
    print dbo.syno('I hope you run it a million times')
    results: I hope you run it {a million|1,000,0 00|{one|1} million} times
    Code:
    print dbo.syno('I hope you run it one million times')
    results: I hope you run it {one million|1,000,0 00|a million} times
    (note how it will replace 'one million', and will process the 'one' in the 'one million' because that's the whole term to be replaced in the first place)
    Code:
    print dbo.syno('I hope someone runs it one million times')
    results: I hope some{one|1} runs it {one million|1,000,0 00|a million} times

    Hope it helps. I don't say it's impossible to refine it (especially in the handling of nested replacements) but I believe it's a good start.

    P.S.: It, indeed, would have been a rather steep learning curve from a very basic level...
    Last edited by azimmer; Aug 10 '07, 03:17 PM. Reason: copy/paste typo in examples

    Comment

    • inspireuk
      New Member
      • Jul 2007
      • 12

      #3
      This is ammmaazzing! I can't thank you enough :)

      However :) just a couple more things if you have the time. Would it be possible only to match full words and with regards to the nested replacements i think i may just programmically recall the sp for small parts if needed so no recursion needed (sorry).

      the precedence i think should be for the longer phrase.

      let me know if there is anything else.

      Josh

      Comment

      • inspireuk
        New Member
        • Jul 2007
        • 12

        #4
        so i've been playing with what you gave me. i've changed it to a stored procedure as i'm no longer doing recursion. if you get a chance can you look at the code below - it does seem to be working but i'm not sure whether it is written the best way. Thanks again for all your time.

        Code:
        ALTER PROCEDURE ParseString
        	@instr as varchar(max)
        AS
        
        	declare @pos as int
        	declare @current_str as varchar(max)
        	declare @syno_term as varchar(max)
        	declare @replaced_length as int
        	declare @outstr as varchar(max)
        
        	set @outstr = ''
        	set @pos=1
        	while @pos<=len(@instr)
        	begin
        		-- try from current position on
        		set @current_str=right(@instr,len(@instr)-@pos+1)
        		declare current_replacements cursor local for
        			-- this is tricky
        			select distinct top 1 with ties w.word [syno_term], len(w2.word) [replaced_length]
        			from words w, LinkedWords l, words w2
        			where w.wordid=l.LinkedWordID
        			and (l.wordid=w2.wordid)
        			and charindex(w2.word,@current_str)=1
        			and right(left(@current_str, len(w2.word)+1),1) = ' '
        			order by len(w2.word) desc
        		open current_replacements
        		fetch next from current_replacements into @syno_term, @replaced_length
        		if @@fetch_status = 0	-- there are replacements
        		begin
        			set @outstr=@outstr+'{'+left(@current_str,@replaced_length)
        			-- iterate on current set of replacements (recursivity)
        			while @@fetch_status = 0
        			begin
        				--set @syno_term=ammuser._syno(@syno_term, 1)
        				set @outstr=@outstr+'|'+@syno_term
        				fetch next from current_replacements into @syno_term, @replaced_length
        			end
        			set @outstr=@outstr+'} '
        			set @pos=@pos+@replaced_length+1
        		end
        		else
        		begin
        			if charindex(' ',@current_str) = 0
        				begin
        					set @outstr=@outstr+@current_str
        					set @pos=@pos+len(@current_str)				
        				end
        			else
        				begin
        					set @outstr=@outstr+left(@current_str,charindex(' ',@current_str))
        					set @pos=@pos+charindex(' ',@current_str)
        				end
        		end
        		deallocate current_replacements
        	end
        	SELECT @outstr

        Comment

        • azimmer
          Recognized Expert New Member
          • Jul 2007
          • 200

          #5
          Originally posted by inspireuk
          so i've been playing with what you gave me. i've changed it to a stored procedure as i'm no longer doing recursion. if you get a chance can you look at the code below - it does seem to be working but i'm not sure whether it is written the best way. Thanks again for all your time.

          Code:
          ALTER PROCEDURE ParseString
          	@instr as varchar(max)
          AS
          
          	declare @pos as int
          	declare @current_str as varchar(max)
          	declare @syno_term as varchar(max)
          	declare @replaced_length as int
          	declare @outstr as varchar(max)
          
          	set @outstr = ''
          	set @pos=1
          	while @pos<=len(@instr)
          	begin
          		-- try from current position on
          		set @current_str=right(@instr,len(@instr)-@pos+1)
          		declare current_replacements cursor local for
          			-- this is tricky
          			select distinct top 1 with ties w.word [syno_term], len(w2.word) [replaced_length]
          			from words w, LinkedWords l, words w2
          			where w.wordid=l.LinkedWordID
          			and (l.wordid=w2.wordid)
          			and charindex(w2.word,@current_str)=1
          			and right(left(@current_str, len(w2.word)+1),1) = ' '
          			order by len(w2.word) desc
          		open current_replacements
          		fetch next from current_replacements into @syno_term, @replaced_length
          		if @@fetch_status = 0	-- there are replacements
          		begin
          			set @outstr=@outstr+'{'+left(@current_str,@replaced_length)
          			-- iterate on current set of replacements (recursivity)
          			while @@fetch_status = 0
          			begin
          				--set @syno_term=ammuser._syno(@syno_term, 1)
          				set @outstr=@outstr+'|'+@syno_term
          				fetch next from current_replacements into @syno_term, @replaced_length
          			end
          			set @outstr=@outstr+'} '
          			set @pos=@pos+@replaced_length+1
          		end
          		else
          		begin
          			if charindex(' ',@current_str) = 0
          				begin
          					set @outstr=@outstr+@current_str
          					set @pos=@pos+len(@current_str)				
          				end
          			else
          				begin
          					set @outstr=@outstr+left(@current_str,charindex(' ',@current_str))
          					set @pos=@pos+charindex(' ',@current_str)
          				end
          		end
          		deallocate current_replacements
          	end
          	SELECT @outstr
          Is good that you have the ambition and time to play around with it as I have very little time at the moment. Still I've been thinking about whole word replacements. One of the tough issues is that you need to have a word terminator at both ends of the term to be replaced - and word terminators come in many forms (not only space, but period, comma, semicolon, dash, new line, tab, etc as well - perhaps whatever is not a letter (but then what about '@' and digits?).) It should somehow be taken care of in the cursor select to make sure no false replacements are signaled (an that's not easy). Alternatively it may be possible to modify the select so that it sometimes gives false possibilities and then check and disregard them (but that's ain't easy either).

          Slowly we're venturing into the realms of grep and awk - implemented in a SQL Server SP :-) (Just kidding; we're miles away.)

          Comment

          Working...