Very strange IO problems

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • jfixsen

    Very strange IO problems

    Hello!

    Oracle 9.2.0.4
    SunOS pchi-db01 5.8 Generic_108528-19 sun4u sparc
    SUNW,Ultra-EnterpriseSyste m = SunOS
    Node = pchi-db01
    Release = 5.8
    KernelID = Generic_108528-19
    Machine = sun4u
    BusType = <unknown>
    Serial = <unknown>
    Users = <unknown>
    OEM# = 0
    Origin# = 1
    NumCPU = 8

    History: we had been using a SAN disk array for storage and then
    switched over to a Sun T3. About a week after moving to the T3, I saw
    the following message in my alert log: WARNING: aiowait timed out 1
    times. No performance problems happened until about a month after
    that.

    Problem: We started seeing huge performance problems from out of
    nowhere on December 16 on everything from big batch jobs (heavy FTS)
    to simply logging in, and I started seeing several of these aiowait
    messages each day, sometimes up to 20. No application changes were
    made at any time during any of this, and after the performance
    problems started, I even cut the load, mostly big FTS jobs, way back.

    I had been running statspack everyday the entire time, and on the day
    the performance problems hit hard (about 5 weeks after going to the
    T3), the noticeable difference I saw in the statspack reports were all
    related to writes from what I could tell.

    BEFORE:

    Top 5 Timed Events
    ~~~~~~~~~~~~~~~ ~~~
    % Total
    Event Waits Time (s)
    Ela Time
    -------------------------------------------- ------------ -----------
    --------
    CPU time 43,689,174
    92.51
    db file scattered read 131,668,468 949,948
    2.01
    PX Deq: Execute Reply 931,750 496,692
    1.05
    direct path read 73,177,620 489,356
    1.04
    PX Deq Credit: send blkd 24,148,414 425,685
    .90

    AFTER:

    Top 5 Timed Events
    ~~~~~~~~~~~~~~~ ~~~
    % Total
    Event Waits Time (s)
    Ela Time
    -------------------------------------------- ------------ -----------
    --------
    log file sync 874,418 604,164
    32.92
    direct path write 121,724 233,840
    12.74
    PX Deq Credit: send blkd 1,103,485 212,285
    11.57
    db file scattered read 12,039,568 165,860
    9.04
    log buffer space 158,742 127,009
    6.92

    BEFORE:

    Avg
    Total Wait wait
    Waits
    Event Waits Timeouts Time (s) (ms)
    /txn
    ---------------------------- ------------ ---------- ---------- ------
    --------
    db file scattered read 131,668,468 0 949,948 7
    60.4
    PX Deq: Execute Reply 931,750 211,623 496,692 533
    0.4
    direct path read 73,177,620 0 489,356 7
    33.5
    PX Deq Credit: send blkd 24,148,414 122,149 425,685 18
    11.1
    db file sequential read 31,794,392 0 349,188 11
    14.6
    direct path write 2,652,105 0 309,880 117
    1.2
    log file sync 3,021,375 104,267 201,582 67
    1.4
    db file parallel write 547,546 254,564 68,136 124
    0.3
    enqueue 27,670 14,655 50,246 1816
    0.0
    log buffer space 110,172 32,943 47,180 428
    0.1

    AFTER:

    Avg Total Wait
    wait Waits
    Event Waits Timeouts Time (s) (ms)
    /txn
    ---------------------------- ------------ ---------- ---------- ------
    --------
    log file sync 874,418 534,000 604,164 691
    3.4
    direct path write 121,724 0 233,840 1921
    0.5
    PX Deq Credit: send blkd 1,103,485 94,421 212,285 192
    4.3
    db file scattered read 12,039,568 0 165,860 14
    46.5
    log buffer space 158,742 120,911 127,009 800
    0.6
    PX Deq: Execute Reply 742,346 61,708 126,984 171
    2.9


    I am about to call Sun, but I want to be sure it's not something I'm
    overlooking. We've got all the latest sun patches, and one post I saw
    that had to do with the aiowait message said to change /etc/system to
    include the following parm after installing the patch (which we
    already had):

    * this parm is associated with the aiowait errors that was corrected
    in patch 112255-01 solaris v8
    set TS:ts_sleep_pro mote=1

    I haven't seen the aiowait message in the alert log since the parm had
    been added and the db restarted, but performance still sucks and
    statspack numbers are still bad. Most noticably, when I try logging
    into sqlplus, it hangs from 10-60 seconds. If I look at the wait
    events from another session, it always waits on log buffer space
    initially, and then log file sync.

    To me, this all smells like a SUN problem, but I would love your
    opinions before I call.

    Thanks

    Jason
    jfixsen@nospam_ virtumundo.com
  • craig

    #2
    Re: Very strange IO problems

    I am a relative amateur on the subject, but I have been reading a book
    Effective Oracle Design By Thomas Kyte p151. He makes mention of direct path
    writes and db file scattered reads. For direct path write waits come from
    delays writing to TEMP tablespaces ( caused by direct loads, Parallel DML
    updates, uncached LOBs. In his example this was caused by a
    PGA_AGGREGATE_T ARGET set too low so he increased size significantly. You
    might want to investigate this area.

    He also researched all scattered reads by checking the code that generated
    the FTS. Seems you eliminated some of this, but maybe looking at why so
    many FTS are occuring. Is indexing not an option for some reason?

    In Kyte's examples he worked till direct path writes and scattered reads
    were gone completely. Not sure if this is normally acheivable or just in
    his example.

    For the rest of your processes there is little mention. Log file sync's may
    be high but I have no reference to go by as what is normal. Perhaps you are
    switching logs too often? Log files too small/too few? Just a WAG guess.
    Also it could be log buffer size issues. If the buffer is too big or too
    small you can have issues though I think bigger is generally better to a
    point. See http://www.ixora.com.au/tips/tuning/log_buffer_size.htm

    Other than this I know of checking the instance efficiency page to make sure
    your soft parce percentage is high (near 100% after the instance has been
    up).

    Hope this helps just tossing out random thought. Please post your result if
    you find a solution. My best guess is it is not an a Sun issue and that it
    is custom code, db tuning issues but that is just a guess using past
    experience.


    "jfixsen" <jfixsen@virtum undo.comwrote in message
    news:26bc4d13.0 401220823.7ecb1 e57@posting.goo gle.com...
    Hello!
    >
    Oracle 9.2.0.4
    SunOS pchi-db01 5.8 Generic_108528-19 sun4u sparc
    SUNW,Ultra-EnterpriseSyste m = SunOS
    Node = pchi-db01
    Release = 5.8
    KernelID = Generic_108528-19
    Machine = sun4u
    BusType = <unknown>
    Serial = <unknown>
    Users = <unknown>
    OEM# = 0
    Origin# = 1
    NumCPU = 8
    >
    History: we had been using a SAN disk array for storage and then
    switched over to a Sun T3. About a week after moving to the T3, I saw
    the following message in my alert log: WARNING: aiowait timed out 1
    times. No performance problems happened until about a month after
    that.
    >
    Problem: We started seeing huge performance problems from out of
    nowhere on December 16 on everything from big batch jobs (heavy FTS)
    to simply logging in, and I started seeing several of these aiowait
    messages each day, sometimes up to 20. No application changes were
    made at any time during any of this, and after the performance
    problems started, I even cut the load, mostly big FTS jobs, way back.
    >
    I had been running statspack everyday the entire time, and on the day
    the performance problems hit hard (about 5 weeks after going to the
    T3), the noticeable difference I saw in the statspack reports were all
    related to writes from what I could tell.
    >
    BEFORE:
    >
    Top 5 Timed Events
    ~~~~~~~~~~~~~~~ ~~~
    % Total
    Event Waits Time (s)
    Ela Time
    -------------------------------------------- ------------ -----------
    --------
    CPU time 43,689,174
    92.51
    db file scattered read 131,668,468 949,948
    2.01
    PX Deq: Execute Reply 931,750 496,692
    1.05
    direct path read 73,177,620 489,356
    1.04
    PX Deq Credit: send blkd 24,148,414 425,685
    .90
    >
    AFTER:
    >
    Top 5 Timed Events
    ~~~~~~~~~~~~~~~ ~~~
    % Total
    Event Waits Time (s)
    Ela Time
    -------------------------------------------- ------------ -----------
    --------
    log file sync 874,418 604,164
    32.92
    direct path write 121,724 233,840
    12.74
    PX Deq Credit: send blkd 1,103,485 212,285
    11.57
    db file scattered read 12,039,568 165,860
    9.04
    log buffer space 158,742 127,009
    6.92
    >
    BEFORE:
    >
    Avg
    Total Wait wait
    Waits
    Event Waits Timeouts Time (s) (ms)
    /txn
    ---------------------------- ------------ ---------- ---------- ------
    --------
    db file scattered read 131,668,468 0 949,948 7
    60.4
    PX Deq: Execute Reply 931,750 211,623 496,692 533
    0.4
    direct path read 73,177,620 0 489,356 7
    33.5
    PX Deq Credit: send blkd 24,148,414 122,149 425,685 18
    11.1
    db file sequential read 31,794,392 0 349,188 11
    14.6
    direct path write 2,652,105 0 309,880 117
    1.2
    log file sync 3,021,375 104,267 201,582 67
    1.4
    db file parallel write 547,546 254,564 68,136 124
    0.3
    enqueue 27,670 14,655 50,246 1816
    0.0
    log buffer space 110,172 32,943 47,180 428
    0.1
    >
    AFTER:
    >
    Avg Total Wait
    wait Waits
    Event Waits Timeouts Time (s) (ms)
    /txn
    ---------------------------- ------------ ---------- ---------- ------
    --------
    log file sync 874,418 534,000 604,164 691
    3.4
    direct path write 121,724 0 233,840 1921
    0.5
    PX Deq Credit: send blkd 1,103,485 94,421 212,285 192
    4.3
    db file scattered read 12,039,568 0 165,860 14
    46.5
    log buffer space 158,742 120,911 127,009 800
    0.6
    PX Deq: Execute Reply 742,346 61,708 126,984 171
    2.9
    >
    >
    I am about to call Sun, but I want to be sure it's not something I'm
    overlooking. We've got all the latest sun patches, and one post I saw
    that had to do with the aiowait message said to change /etc/system to
    include the following parm after installing the patch (which we
    already had):
    >
    * this parm is associated with the aiowait errors that was corrected
    in patch 112255-01 solaris v8
    set TS:ts_sleep_pro mote=1
    >
    I haven't seen the aiowait message in the alert log since the parm had
    been added and the db restarted, but performance still sucks and
    statspack numbers are still bad. Most noticably, when I try logging
    into sqlplus, it hangs from 10-60 seconds. If I look at the wait
    events from another session, it always waits on log buffer space
    initially, and then log file sync.
    >
    To me, this all smells like a SUN problem, but I would love your
    opinions before I call.
    >
    Thanks
    >
    Jason
    jfixsen@nospam_ virtumundo.com

    Comment

    Working...