Cron job to remove logically redundant entries in Postgres SQL

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Abhilash28coder
    New Member
    • Jun 2019
    • 1

    Cron job to remove logically redundant entries in Postgres SQL

    I have a requirement to delete records from a Postgres SQL table which has more than 200 million records. The table is not having any primary key.

    The sample table (Bookmark is the name of table) content is as below:

    Code:
        systemId	 filename		    mindatetime					maxdatetime
          70277		monitor_1.dat	2019-04-21 08:00:00 AM		2019-04-21 03:10:00 PM
          10006		monitor_2.dat	2019-04-25 10:00:00 AM		2019-04-25 11:30:00 AM
          10006		monitor_3.dat	2019-04-28 08:00:00 AM		2019-04-28 10:00:00 AM
          10006		monitor_3.dat	2019-04-28 09:00:00 AM		2019-04-28 11:00:00 AM
          10006		monitor_3.dat	2019-04-28 07:00:00 AM		2019-04-28 04:00:00 PM
          8368		monitor_1.dat	2019-05-21 11:00:00 AM		2019-05-21 11:30:00 AM
          8368		monitor_7.dat	2019-05-21 06:00:00 AM		2019-05-21 11:00:00 AM
          8368		monitor_5.dat	2019-05-23 08:00:00 AM		2019-05-23 10:00:00 AM

    The cron job should run on a given schedule to delete the records which are logically redundant.

    To explain this let's take the case of systemId '10006' where filename is 'monitor_3.dat' having 3 entries with min and max date timestamp of the same day.

    Logically we can delete the entries having mindatetime 08:00:00 AM and 09:00:00 AM, maxdatetime 10:00:00 AM, 11:00:00 AM as that interval is being covered by the other entry which has mindatetime as 7 AM and maxdatetime as 4 PM.

    So those entries would fall under this interval and the job should identify such entries in the entire table and delete them.

    My resultant output table content in this case should be:

    Code:
        systemId	 filename		    mindatetime					maxdatetime
          70277		monitor_1.dat	2019-04-21 08:00:00 AM		2019-04-21 03:10:00 PM
          10006		monitor_2.dat	2019-04-25 10:00:00 AM		2019-04-25 11:30:00 AM
          10006		monitor_3.dat	2019-04-28 07:00:00 AM		2019-04-28 04:00:00 PM
          8368		monitor_1.dat	2019-05-21 11:00:00 AM		2019-05-21 11:30:00 AM
          8368		monitor_7.dat	2019-05-21 06:00:00 AM		2019-05-21 11:00:00 AM
          8368		monitor_5.dat	2019-05-23 08:00:00 AM		2019-05-23 10:00:00 AM
    The table size is more than 20Gb on disk so I was exploring writing a sql procedure or job to achieve this but not able to make much progress. Any ideas or suggestions for overcoming this complex scenario?
    Last edited by zmbd; Jun 14 '19, 03:41 PM. Reason: [z{Added [CODE/] format to hold the example table spacing :-) }]
  • Rabbit
    Recognized Expert MVP
    • Jan 2007
    • 12517

    #2
    What happens if they only partially overlap? What happens if they fully overlap but over two different entries? What happens if they overlap but the start and end are the same? You need to clearly define the requirements otherwise you're going to run into trouble down the road.

    Whatever the case may be, the answer will probably be to join the table to itself to find overlapping entries. How you formulate that join will depend on what you need to happen in the scenarios above.

    Comment

    Working...