SP to parse string adding synonimes from table

**azimmer** · Aug 10 '07, 03:07 PM

It's going to be a long one... (But a promise is a promise.) What makes it a bit more difficult is the nested replacements (whose proper handling I'm still not 100% sure of):

have it greedy or non-greedy (if you have replacements both for 'one million' and 'one' which one has precedence)?
If any replacement would result in an expanded string that could be further replaced, what to do? (I.e.: replacements: '1'->'one', 'one million'->'a million', what should the output be for '1 million'? And if replacements are '1'->'one', 'one million'->'1,000,000', what should the output be for '1 million'?)

The code below is greedy and processes nested replacements only to the length of the replaced term and not including the term being replaced (see examples at the end). It has two functions because the SQL Server wouldn't take the default value (don't know why) and I've had no more time to play around with it. You only need to use the one without the underscore.

So here goes the code:

Code:

create function _syno (@instr as varchar(1000), @term_replace as bit = 0)
returns varchar(1000)
as
begin
declare @pos as int
declare @current_str as varchar(1000)
declare @syno_term as varchar(1000)
declare @replaced_length as int
declare @outstr as varchar(1000)
set @outstr=''

set @pos=1
while @pos<=len(@instr)
begin
	-- try from current position on
	set @current_str=right(@instr,len(@instr)-@pos+1)
	declare current_replacements cursor local for
		-- this is tricky
		select distinct top 1 with ties w.word [syno_term], len(w2.word) [replaced_length]
		from words w, wordlinks l, words w2
		where w.wordid=l.wordlinkid
		and (l.wordid=w2.wordid)
		and charindex(w2.word,@current_str)=1
		and (@term_replace=0 or (w2.word<>@current_str))
		order by len(w2.word) desc
	open current_replacements
	fetch next from current_replacements into @syno_term, @replaced_length
	if @@fetch_status = 0	-- there are replacements
	begin
		set @outstr=@outstr+'{'+left(@current_str,@replaced_length)
		-- iterate on current set of replacements (recursivity)
		while @@fetch_status = 0
		begin
			set @syno_term=dbo._syno(@syno_term, 1)
			set @outstr=@outstr+'|'+@syno_term
			fetch next from current_replacements into @syno_term, @replaced_length
		end
		set @outstr=@outstr+'}'
		set @pos=@pos+@replaced_length
	end
	else
	begin
		set @outstr=@outstr+left(@current_str,1)
		set @pos=@pos+1
	end
	deallocate current_replacements
end
return @outstr 
end
GO

create function syno (@instr as varchar(1000))
returns varchar(1000)
as
begin
return dbo._syno(@instr,0)
end
GO

Note: I found the easiest way to get the code itself (without line numbers) is to "Reply" to the message and copy-paste whatever is between '[ CODE]' and '[ /CODE]'.

My WORDS and WORDLINKS tables are as follows:

Code:

WordID      Word                                               
----------- -------------------
1           1,000,000
2           a million
3           one million
4           one
5           1

WordID      WordLinkID  
----------- ----------- 
1           2
1           3
2           1
2           3
3           1
3           2
4           5

Examples and usage:

Code:

print dbo.syno('I hope you run it a million times')

results: I hope you run it {a million|1,000,0 00|{one|1} million} times

Code:

print dbo.syno('I hope you run it one million times')

results: I hope you run it {one million|1,000,0 00|a million} times
(note how it will replace 'one million', and will process the 'one' in the 'one million' because that's the whole term to be replaced in the first place)

Code:

print dbo.syno('I hope someone runs it one million times')

results: I hope some{one|1} runs it {one million|1,000,0 00|a million} times

Hope it helps. I don't say it's impossible to refine it (especially in the handling of nested replacements) but I believe it's a good start.

P.S.: It, indeed, would have been a rather steep learning curve from a very basic level...

**inspireuk** · Aug 10 '07, 06:18 PM

This is ammmaazzing! I can't thank you enough :)

However :) just a couple more things if you have the time. Would it be possible only to match full words and with regards to the nested replacements i think i may just programmically recall the sp for small parts if needed so no recursion needed (sorry).

the precedence i think should be for the longer phrase.

let me know if there is anything else.

Josh

**inspireuk** · Aug 13 '07, 07:53 PM

so i've been playing with what you gave me. i've changed it to a stored procedure as i'm no longer doing recursion. if you get a chance can you look at the code below - it does seem to be working but i'm not sure whether it is written the best way. Thanks again for all your time.

Code:

ALTER PROCEDURE ParseString
	@instr as varchar(max)
AS

	declare @pos as int
	declare @current_str as varchar(max)
	declare @syno_term as varchar(max)
	declare @replaced_length as int
	declare @outstr as varchar(max)

	set @outstr = ''
	set @pos=1
	while @pos<=len(@instr)
	begin
		-- try from current position on
		set @current_str=right(@instr,len(@instr)-@pos+1)
		declare current_replacements cursor local for
			-- this is tricky
			select distinct top 1 with ties w.word [syno_term], len(w2.word) [replaced_length]
			from words w, LinkedWords l, words w2
			where w.wordid=l.LinkedWordID
			and (l.wordid=w2.wordid)
			and charindex(w2.word,@current_str)=1
			and right(left(@current_str, len(w2.word)+1),1) = ' '
			order by len(w2.word) desc
		open current_replacements
		fetch next from current_replacements into @syno_term, @replaced_length
		if @@fetch_status = 0	-- there are replacements
		begin
			set @outstr=@outstr+'{'+left(@current_str,@replaced_length)
			-- iterate on current set of replacements (recursivity)
			while @@fetch_status = 0
			begin
				--set @syno_term=ammuser._syno(@syno_term, 1)
				set @outstr=@outstr+'|'+@syno_term
				fetch next from current_replacements into @syno_term, @replaced_length
			end
			set @outstr=@outstr+'} '
			set @pos=@pos+@replaced_length+1
		end
		else
		begin
			if charindex(' ',@current_str) = 0
				begin
					set @outstr=@outstr+@current_str
					set @pos=@pos+len(@current_str)				
				end
			else
				begin
					set @outstr=@outstr+left(@current_str,charindex(' ',@current_str))
					set @pos=@pos+charindex(' ',@current_str)
				end
		end
		deallocate current_replacements
	end
	SELECT @outstr

**azimmer** · Aug 14 '07, 08:50 AM

Originally posted by inspireuk

so i've been playing with what you gave me. i've changed it to a stored procedure as i'm no longer doing recursion. if you get a chance can you look at the code below - it does seem to be working but i'm not sure whether it is written the best way. Thanks again for all your time.

Code:

ALTER PROCEDURE ParseString
	@instr as varchar(max)
AS

	declare @pos as int
	declare @current_str as varchar(max)
	declare @syno_term as varchar(max)
	declare @replaced_length as int
	declare @outstr as varchar(max)

	set @outstr = ''
	set @pos=1
	while @pos<=len(@instr)
	begin
		-- try from current position on
		set @current_str=right(@instr,len(@instr)-@pos+1)
		declare current_replacements cursor local for
			-- this is tricky
			select distinct top 1 with ties w.word [syno_term], len(w2.word) [replaced_length]
			from words w, LinkedWords l, words w2
			where w.wordid=l.LinkedWordID
			and (l.wordid=w2.wordid)
			and charindex(w2.word,@current_str)=1
			and right(left(@current_str, len(w2.word)+1),1) = ' '
			order by len(w2.word) desc
		open current_replacements
		fetch next from current_replacements into @syno_term, @replaced_length
		if @@fetch_status = 0	-- there are replacements
		begin
			set @outstr=@outstr+'{'+left(@current_str,@replaced_length)
			-- iterate on current set of replacements (recursivity)
			while @@fetch_status = 0
			begin
				--set @syno_term=ammuser._syno(@syno_term, 1)
				set @outstr=@outstr+'|'+@syno_term
				fetch next from current_replacements into @syno_term, @replaced_length
			end
			set @outstr=@outstr+'} '
			set @pos=@pos+@replaced_length+1
		end
		else
		begin
			if charindex(' ',@current_str) = 0
				begin
					set @outstr=@outstr+@current_str
					set @pos=@pos+len(@current_str)				
				end
			else
				begin
					set @outstr=@outstr+left(@current_str,charindex(' ',@current_str))
					set @pos=@pos+charindex(' ',@current_str)
				end
		end
		deallocate current_replacements
	end
	SELECT @outstr

Is good that you have the ambition and time to play around with it as I have very little time at the moment. Still I've been thinking about whole word replacements. One of the tough issues is that you need to have a word terminator at both ends of the term to be replaced - and word terminators come in many forms (not only space, but period, comma, semicolon, dash, new line, tab, etc as well - perhaps whatever is not a letter (but then what about '@' and digits?).) It should somehow be taken care of in the cursor select to make sure no false replacements are signaled (an that's not easy). Alternatively it may be possible to modify the select so that it sometimes gives false possibilities and then check and disregard them (but that's ain't easy either).

Slowly we're venturing into the realms of grep and awk - implemented in a SQL Server SP :-) (Just kidding; we're miles away.)

SP to parse string adding synonimes from table

SP to parse string adding synonimes from table

Comment

Comment

Comment

Comment