Hello,
I am importing data that lists rates for particular coverages for a
particular period of time. Unfortunately, the data source isn't very
clean. I've come up with some rules that I think will work to clean the
data, but I'm having trouble putting those rules into efficient SQL.
The table that I'm dealing with has just under 9M rows and I may need
to use similar logic on an even larger table, so I'd like something
that can be made efficient to some degree using indexes if necessary.
Here is some sample (simplified) code:
CREATE TABLE Coverage_Rates (
rate_id INT IDENTITY NOT NULL,
coverage_id INT NOT NULL,
start_date SMALLDATETIME NOT NULL,
end_date SMALLDATETIME NOT NULL,
rate MONEY NOT NULL )
GO
INSERT INTO Coverage_Rates VALUES (1, '2004-01-01', '2004-06-01',
40.00)
INSERT INTO Coverage_Rates VALUES (1, '2004-03-01', '2004-08-01',
20.00)
INSERT INTO Coverage_Rates VALUES (1, '2004-06-01', '2004-08-01',
30.00)
INSERT INTO Coverage_Rates VALUES (2, '2004-01-01', '9999-12-31',
90.00)
INSERT INTO Coverage_Rates VALUES (2, '2004-03-01', '2004-08-01',
20.00)
INSERT INTO Coverage_Rates VALUES (2, '2004-08-01', '2004-08-01',
30.00)
GO
The rule is basically this... for any given period of time, for a
particular coverage, always use the coverage with the highest rate. So,
given the rows above, I would want the results to be:
coverage_id start_dt end_dt rate
----------- ---------- ---------- --------
1 2004-01-01 2004-06-01 40.00
1 2004-06-01 2004-08-01 30.00
2 2004-01-01 9999-12-31 90.00
There can be any combination of start and end dates in the source, but
in my final results I would like to be able to have only one distinct
row for any given time and coverage ID. So, given any date @my_date,
SELECT coverage_id, COUNT(*)
FROM <results>
WHERE @my_date >= start_dt
AND @my_date < end_dt
GROUP BY coverage_id
HAVING COUNT(*) > 1
the above query should return 0 rows.
Thanks for any help!
-Tom.
I am importing data that lists rates for particular coverages for a
particular period of time. Unfortunately, the data source isn't very
clean. I've come up with some rules that I think will work to clean the
data, but I'm having trouble putting those rules into efficient SQL.
The table that I'm dealing with has just under 9M rows and I may need
to use similar logic on an even larger table, so I'd like something
that can be made efficient to some degree using indexes if necessary.
Here is some sample (simplified) code:
CREATE TABLE Coverage_Rates (
rate_id INT IDENTITY NOT NULL,
coverage_id INT NOT NULL,
start_date SMALLDATETIME NOT NULL,
end_date SMALLDATETIME NOT NULL,
rate MONEY NOT NULL )
GO
INSERT INTO Coverage_Rates VALUES (1, '2004-01-01', '2004-06-01',
40.00)
INSERT INTO Coverage_Rates VALUES (1, '2004-03-01', '2004-08-01',
20.00)
INSERT INTO Coverage_Rates VALUES (1, '2004-06-01', '2004-08-01',
30.00)
INSERT INTO Coverage_Rates VALUES (2, '2004-01-01', '9999-12-31',
90.00)
INSERT INTO Coverage_Rates VALUES (2, '2004-03-01', '2004-08-01',
20.00)
INSERT INTO Coverage_Rates VALUES (2, '2004-08-01', '2004-08-01',
30.00)
GO
The rule is basically this... for any given period of time, for a
particular coverage, always use the coverage with the highest rate. So,
given the rows above, I would want the results to be:
coverage_id start_dt end_dt rate
----------- ---------- ---------- --------
1 2004-01-01 2004-06-01 40.00
1 2004-06-01 2004-08-01 30.00
2 2004-01-01 9999-12-31 90.00
There can be any combination of start and end dates in the source, but
in my final results I would like to be able to have only one distinct
row for any given time and coverage ID. So, given any date @my_date,
SELECT coverage_id, COUNT(*)
FROM <results>
WHERE @my_date >= start_dt
AND @my_date < end_dt
GROUP BY coverage_id
HAVING COUNT(*) > 1
the above query should return 0 rows.
Thanks for any help!
-Tom.
Comment