DB failure?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Wes Palmer

    DB failure?

    PostgreSQL: 7.4.1

    Last week, I had a corrupt index on one table with 2 million rows. On a
    specific search, the database would SEGV. I dropped and recreated the index
    involved in the search, and did a REINDEX on the primary key. That problem
    went away.

    Now I'm seeing:

    db=> select count(*) from messages;
    ERROR: could not access status of transaction 859000513
    DETAIL: could not open file "/db/pgsql/data/pg_clog/0333": No such file or
    directory

    db=> select count(*) from message_recipie nts;
    ERROR: invalid page header in block 1238604 of relation
    "message_recipi ents"


    The above commands were successful on 8/21.

    There are 240 million rows. Dump/reload is not something that would be an
    attractive option. Based on previous timings, it could take 2-3 days (this
    is with a dual hyper-threaded 2.4 Ghz with 2GB memory and an 8 drive RAID 5)
    on a production system.

    So far, this database has been all INSERTs - no deletes or updates.

    Is there a way to recover this without a dump/reload? We do nightly
    backups, but since we don't know when the problem really started, it would
    be rather difficult to restore and reapply several million updates.

    Wes


    ---------------------------(end of broadcast)---------------------------
    TIP 7: don't forget to increase your free space map settings

  • Wes

    #2
    Re: DB failure?

    On 8/30/04 11:07 PM, "Wes Palmer" <wespvp@syntegr a.com> wrote:
    [color=blue]
    > db=> select count(*) from messages;
    > ERROR: could not access status of transaction 859000513
    > DETAIL: could not open file "/db/pgsql/data/pg_clog/0333": No such file or
    > directory
    >
    > db=> select count(*) from message_recipie nts;
    > ERROR: invalid page header in block 1238604 of relation
    > "message_recipi ents"[/color]

    Uh, oh.. This would appear to be a big problem... I just tried to do a
    pg_dumpall. The server SEGV'd around a Gig into the pg_dumpall:

    LOG: server process (PID 12541) was terminated by signal 11
    LOG: terminating any other active server processes
    WARNING: terminating connection because of crash of another server process
    DETAIL: The postmaster has commanded this server process to roll back the
    current transaction and exit, because another server process exited
    abnormally and possibly corrupted shared memory.
    HINT: In a moment you should be able to reconnect to the database and
    repeat your command.
    LOG: all server processes terminated; reinitializing
    LOG: database system was interrupted at 2004-08-31 07:13:31 MEST
    LOG: checkpoint record is at 6E/13A916C
    LOG: redo record is at 6E/13A916C; undo record is at 0/0; shutdown FALSE
    LOG: next transaction ID: 173895; next OID: 243689524
    LOG: database system was not properly shut down; automatic recovery in
    progress
    LOG: record with zero length at 6E/13A91AC
    LOG: redo is not required
    LOG: recycled transaction log file "0000006E000000 00"
    LOG: database system is ready



    ---------------------------(end of broadcast)---------------------------
    TIP 8: explain analyze is your friend

    Comment

    • Martijn van Oosterhout

      #3
      Re: DB failure?

      Time to test your memory and harddisk....

      On Tue, Aug 31, 2004 at 12:24:53AM -0500, Wes wrote:[color=blue]
      > On 8/30/04 11:07 PM, "Wes Palmer" <wespvp@syntegr a.com> wrote:
      > [color=green]
      > > db=> select count(*) from messages;
      > > ERROR: could not access status of transaction 859000513
      > > DETAIL: could not open file "/db/pgsql/data/pg_clog/0333": No such file or
      > > directory
      > >
      > > db=> select count(*) from message_recipie nts;
      > > ERROR: invalid page header in block 1238604 of relation
      > > "message_recipi ents"[/color]
      >
      > Uh, oh.. This would appear to be a big problem... I just tried to do a
      > pg_dumpall. The server SEGV'd around a Gig into the pg_dumpall:
      >
      > LOG: server process (PID 12541) was terminated by signal 11
      > LOG: terminating any other active server processes
      > WARNING: terminating connection because of crash of another server process
      > DETAIL: The postmaster has commanded this server process to roll back the
      > current transaction and exit, because another server process exited
      > abnormally and possibly corrupted shared memory.
      > HINT: In a moment you should be able to reconnect to the database and
      > repeat your command.
      > LOG: all server processes terminated; reinitializing
      > LOG: database system was interrupted at 2004-08-31 07:13:31 MEST
      > LOG: checkpoint record is at 6E/13A916C
      > LOG: redo record is at 6E/13A916C; undo record is at 0/0; shutdown FALSE
      > LOG: next transaction ID: 173895; next OID: 243689524
      > LOG: database system was not properly shut down; automatic recovery in
      > progress
      > LOG: record with zero length at 6E/13A91AC
      > LOG: redo is not required
      > LOG: recycled transaction log file "0000006E000000 00"
      > LOG: database system is ready
      >
      >
      >
      > ---------------------------(end of broadcast)---------------------------
      > TIP 8: explain analyze is your friend[/color]

      --
      Martijn van Oosterhout <kleptog@svana. org> http://svana.org/kleptog/[color=blue]
      > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
      > tool for doing 5% of the work and then sitting around waiting for someone
      > else to do the other 95% so you can sue them.[/color]

      -----BEGIN PGP SIGNATURE-----
      Version: GnuPG v1.0.6 (GNU/Linux)
      Comment: For info see http://www.gnupg.org

      iD8DBQFBNEEPY5T wig3Ge+YRAny9AJ 90y4XCC7XQhnzuy dKzIc1czTocyQCg 0HIJ
      jGX3OCOTrU1JX0I qmBOkrrw=
      =7209
      -----END PGP SIGNATURE-----

      Comment

      • Chester Kustarz

        #4
        Re: DB failure?

        On Mon, 30 Aug 2004, Wes Palmer wrote:[color=blue]
        > PostgreSQL: 7.4.1
        >
        > Last week, I had a corrupt index on one table with 2 million rows. On a
        > specific search, the database would SEGV. I dropped and recreated the index
        > involved in the search, and did a REINDEX on the primary key. That problem
        > went away.
        >
        > Now I'm seeing:
        >
        > db=> select count(*) from messages;
        > ERROR: could not access status of transaction 859000513
        > DETAIL: could not open file "/db/pgsql/data/pg_clog/0333": No such file or
        > directory[/color]

        I saw the above 2 types of errors (transaction status, segv) yesterday on
        a box that turned out to have issues writing files correctly to disk. I
        wrote a tool to write a large file to disk and then reopened the file
        to read/verify the contents and it would fail every so often (like 1-5% of
        the time.) What puzzles me is that the machine would work at all with an
        issue like that. Turning on/off the battery backed cache had no effect.

        The machine has ECC memory, but I tested that as well but it turned up
        nothing. I am using PostgreSQL 7.4.1.



        ---------------------------(end of broadcast)---------------------------
        TIP 7: don't forget to increase your free space map settings

        Comment

        Working...