首页 > 代码库 > direct path write temp引起的数据库重启以及ogg进程停止

direct path write temp引起的数据库重启以及ogg进程停止

集中客户系统在2016年11月8日17点业务反馈系统链接不上,是不是有问题。当即查看系统

[oracle@SIDDB01 hwbackup3]$crs_stat -t -v                               

Name           Type           R/RA   F/FT   Target    State     Host    

----------------------------------------------------------------------  

ora....D1.inst application    1/5    0/0    ONLINE    ONLINE    siddb01 

ora....D2.inst application    0/5    0/0    ONLINE    ONLINE    esbdb01 

ora....ID2.srv application    0/0    0/0    ONLINE    ONLINE    esbdb01 

ora....Bsrv.cs application    0/0    0/1    ONLINE    ONLINE    esbdb01 

ora....ID1.srv application    0/0    0/0    ONLINE    ONLINE    siddb01 

ora....Dsrv.cs application    0/0    0/1    ONLINE    ONLINE    siddb01 

ora.ESBSID.db  application    0/0    0/1    ONLINE    ONLINE    siddb01 

ora....01.lsnr application    0/5    0/0    ONLINE    ONLINE    esbdb01 

ora....b01.gsd application    2/5    0/0    ONLINE    OFFLINE           

ora....b01.ons application    0/3    0/0    ONLINE    ONLINE    esbdb01 

ora....b01.vip application    0/0    0/0    ONLINE    ONLINE    esbdb01 

ora....01.lsnr application    0/5    0/0    ONLINE    ONLINE    siddb01 

ora....b01.gsd application    5/5    0/0    ONLINE    OFFLINE           

ora....b01.ons application    1/3    0/0    ONLINE    ONLINE    siddb01 

ora....b01.vip application    0/0    0/0    ONLINE    ONLINE    siddb01 


因为监控5分钟自动采集还没来得及嗅探,数据库自启动完成,所以运维人员都不知道数据库发生了什么事情。查看一下CRS日志情况,看看是否有切换动作;

                                                                                                                                                            

2016-11-02 16:26:21.226: [  CRSAPP][2990760] CheckResource error for ora.siddb01.vip error code = -1                                                        

2016-11-08 17:08:15.846: [  CRSEVT][3027957] CAAMonitorHandler :: 0:Could not join /oracle/crs/bin/racgwrap(check)                                          

category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child                                        

                                                                                                                                                            

2016-11-08 17:08:25.193: [  CRSEVT][3027957] CAAMonitorHandler :: 0:Action Script /oracle/crs/bin/racgwrap(check) timed out for ora.siddb01.vip! (timeout=60

)                                                                                                                                                           

2016-11-08 17:08:25.194: [  CRSAPP][3027957] CheckResource error for ora.siddb01.vip error code = -2                                                        

2016-11-08 17:09:53.467: [  CRSEVT][3027958] CAAMonitorHandler :: 0:Could not join /oracle/product/10.2.0/bin/racgwrap(check)                               

category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child                                        

                                                                                                                                                            

2016-11-08 17:10:07.582: [  CRSAPP][3027958] CheckResource error for ora.siddb01.LISTENER_SIDDB01.lsnr error code = -1                                      

2016-11-08 17:10:14.102: [  CRSEVT][3027962] CAAMonitorHandler :: 0:Could not join /oracle/crs/bin/racgwrap(check)                                          

category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child                                        

                                                                                                                                                            

2016-11-08 17:10:19.909: [  CRSAPP][3027962] CheckResource error for ora.ESBSID.db error code = -1                                                          

2016-11-08 17:10:39.084: [  CRSRES][3027965] In stateChanged, ora.ESBSID.ESBSID1.inst target is ONLINE                                                      

2016-11-08 17:10:41.191: [  CRSRES][3027965] ora.ESBSID.ESBSID1.inst on siddb01 went OFFLINE unexpectedly                                                   

2016-11-08 17:10:43.326: [  CRSRES][3027965] StopResource: setting CLI values                                                                               

2016-11-08 17:10:43.723: [  CRSRES][3027965] Attempting to stop `ora.ESBSID.ESBSID1.inst` on member `siddb01`                                               

2016-11-08 17:11:09.197: [  CRSRES][3027977] In stateChanged, ora.ESBSID.SIDsrv.ESBSID1.srv target is ONLINE                                                

2016-11-08 17:11:09.198: [  CRSRES][3027977] ora.ESBSID.SIDsrv.ESBSID1.srv on siddb01 went OFFLINE unexpectedly                                             

2016-11-08 17:11:09.198: [  CRSRES][3027977] StopResource: setting CLI values                                                                               

2016-11-08 17:11:09.885: [  CRSRES][3027977] Attempting to stop `ora.ESBSID.SIDsrv.ESBSID1.srv` on member `siddb01`                                         

2016-11-08 17:11:10.543: [  CRSRES][3027977] Stop of `ora.ESBSID.SIDsrv.ESBSID1.srv` on member `siddb01` succeeded.                                         

2016-11-08 17:11:10.614: [  CRSRES][3027977] ora.ESBSID.SIDsrv.ESBSID1.srv RESTART_COUNT=0 RESTART_ATTEMPTS=0                                               

2016-11-08 17:11:10.730: [  CRSRES][3027977] ora.ESBSID.SIDsrv.ESBSID1.srv failed on siddb01 relocating.                                                    

2016-11-08 17:11:13.721: [  CRSRES][3027977] Attempting to start `ora.ESBSID.SIDsrv.ESBSID1.srv` on member `esbdb01`                                        

2016-11-08 17:11:26.264: [  CRSAPP][3027988] CheckResource error for ora.ESBSID.SIDsrv.cs error code = 1                                                    

2016-11-08 17:11:26.267: [  CRSRES][3027988] In stateChanged, ora.ESBSID.SIDsrv.cs target is ONLINE                                                         

2016-11-08 17:11:26.267: [  CRSRES][3027988] ora.ESBSID.SIDsrv.cs on siddb01 went OFFLINE unexpectedly                                                      

2016-11-08 17:11:26.268: [  CRSRES][3027988] StopResource: setting CLI values                                                                               

2016-11-08 17:11:26.300: [  CRSRES][3027988] Attempting to stop `ora.ESBSID.SIDsrv.cs` on member `siddb01`                                                  

2016-11-08 17:11:26.564: [  CRSRES][3027988] Stop of `ora.ESBSID.SIDsrv.cs` on member `siddb01` succeeded.                                                  

2016-11-08 17:11:26.564: [  CRSRES][3027988] ora.ESBSID.SIDsrv.cs RESTART_COUNT=0 RESTART_ATTEMPTS=0                                                        

2016-11-08 17:11:27.639: [  CRSRES][3027988] ora.ESBSID.SIDsrv.cs exceeded it‘s failure threshold.  Stopping it and its dependents!                         

2016-11-08 17:11:27.656: [  CRSRES][3027988] ora.ESBSID.SIDsrv.cs target set to OFFLINE before stop action                                                  

2016-11-08 17:11:27.656: [  CRSRES][3027988] StopResource: setting CLI values                                                                               

2016-11-08 17:11:27.665: [  CRSRES][3027988] Target set to OFFLINE for `ora.ESBSID.SIDsrv.cs`                                                               

2016-11-08 17:11:28.419: [  CRSRES][3027965] Stop of `ora.ESBSID.ESBSID1.inst` on member `siddb01` succeeded.                                               

2016-11-08 17:11:28.420: [  CRSRES][3027965] ora.ESBSID.ESBSID1.inst RESTART_COUNT=1 RESTART_ATTEMPTS=5                                                     

2016-11-08 17:11:28.420: [  CRSRES][3027965] ora.ESBSID.ESBSID1.inst Uptime exceeds uptime_threshold, resetting RC                                          

2016-11-08 17:11:28.420: [  CRSRES][3027965] Restarting ora.ESBSID.ESBSID1.inst on siddb01                                                                  

2016-11-08 17:11:28.426: [  CRSRES][3027965] startRunnable: setting CLI values                                                                              

2016-11-08 17:11:28.426: [  CRSRES][3027965] Attempting to start `ora.ESBSID.ESBSID1.inst` on member `siddb01`                                              

2016-11-08 17:11:34.766: [  CRSRES][3027977] Start of `ora.ESBSID.SIDsrv.ESBSID1.srv` on member `esbdb01` succeeded.                                        

2016-11-08 17:11:46.009: [  CRSRES][3027999] startRunnable: setting CLI values                                                                              

2016-11-08 17:12:22.967: [  CRSRES][3027965] Start of `ora.ESBSID.ESBSID1.inst` on member `siddb01` succeeded.                                              

2016-11-08 17:12:22.968: [  CRSRES][3027965] Successfully restarted ora.ESBSID.ESBSID1.inst on siddb01, RESTART_COUNT=1                                     

2016-11-08 17:12:22.997: [  CRSRES][3027965] ora.ESBSID.ESBSID1.inst Updated LAST_RESTART time in ocr               


打算从会话连接数入手,找到连接数超高时间点,然后以此来查询当时数据库等待事件,然后找到对应的SQL进行分析;


INSTANCE_NUMBER TO_CHAR(TRUNC(SAMPLE_TIME,‘MI‘),‘YYYY-MM-DDHH24:MI‘)                          COUNT(*)
--------------- --------------------------------------------------------------------------- ----------
              1 2016-11-08 16:55                                                                   270
              1 2016-11-08 16:56                                                                   951
              1 2016-11-08 16:57                                                                  1917
              1 2016-11-08 16:58                                                                  2000
              1 2016-11-08 16:59                                                                  1152
              1 2016-11-08 17:12                                                                   379
              1 2016-11-08 17:13                                                                   560
              1 2016-11-08 17:14                                                                   531
              1 2016-11-08 17:15                                                                   458
              1 2016-11-08 17:16                                                                   125
              1 2016-11-08 17:17                                                                   106

查看一下当时时间点的等待事件

SQL> SELECT event_id, event, COUNT (*) cnt                                               
  2      FROM dba_hist_active_sess_history                                               
  3     WHERE snap_id  between 64368 and 64369 AND wait_class_id = 1740759767            
  4  GROUP BY event_id, event   ORDER BY 3 desc;                                         
                                                                                         
  EVENT_ID EVENT                                                                   CNT   
---------- ---------------------------------------------------------------- ----------   
  38438084 direct path write temp                                                 2867   
3926164927 direct path read                                                       2326   
2652584166 db file sequential read                                                1710   
 861319509 direct path read temp                                                   616   
3056446529 read by other session                                                   604   
 506183215 db file scattered read                                                  572   
 834992820 db file parallel read                                                    29   
 885859547 direct path write                                                        18   
SQL> SELECT sql_id, COUNT (*) cnt                                                        
  2    FROM dba_hist_active_sess_history                                                 
  3   WHERE snap_id  between 64368 and 64369                                             
  4         AND event_id IN(38438084)GROUP BY sql_id                                     
  5  HAVING COUNT (*) > 1                                                                
  6  ORDER BY 2 desc;                                                                    
                                                                                         
SQL_ID               CNT                                                                 
------------- ----------                                                                 
a9k9pzsnh5s9n        575                                                                 
gjv7gx73q07c6        258                                                                 
7zbvg66aw143a        227                                                                 
01spkazwnt63u        206                                                                 
dt4xb2ausdy1z        187                                                                 
768krsak8tg2d        185                                                                 
4mzyj9vbyypc8        148                                                                 
d1j0r7w8hwrp9        138                                                                 
du52p4c2735y2        123                                                                 
1wczkxv3znt0h        107                                                                 
b8sj2zpc6n4kh        103                                                                 
4q08ghwrssm09         99                                                                 
ggurqxjuqpca0         84                                                                 
67ram5177zdgf         75                                                               

SQL> SELECT sql_id, COUNT (*) cnt                                                        
  2    FROM dba_hist_active_sess_history                                                 
  3   WHERE snap_id  between 64368 and 64369                                             
  4         AND event_id IN(3926164927)GROUP BY sql_id                                   
  5  HAVING COUNT (*) > 1                                                                
  6  ORDER BY 2 desc;                                                                    
                                                                                         
SQL_ID               CNT                                                                 
------------- ----------                                                                 
b8sj2zpc6n4kh        638                                                                 
768krsak8tg2d        607                                                                 
67ram5177zdgf        315                                                                 
6varm0b41sfxd        164                                                                 
4q08ghwrssm09        109                                                                 
01spkazwnt63u         64                                                                 
g9x112cs2dr4h         64            

   1,大量的磁盘排序操作,order by, group by, union, distinct,无法在PGA中完成排序,需要利用temp表空间进行排序,当从临时表空间中读取排序结果时,会产生direct path read.
   2. 大量的Hash Join操作,利用temp表空间保存hash区。
   3. SQL语句的并行处理
   4. 大表的全表扫描

技术分享

上面列举的SQL全都是通过SQL*PLUS连接进来执行的语句也都是distinct\parallel 20

技术分享

系统内当时也有很多的direct path read、db file sequential read等待事件;

在排序操作(order by/group by/union/distinct/rollup/合并连接)时,由于PGA中的SORT_AREA_SIZE空间不足,造成需要使用临时表空间来保存中间结果,当从临时表空间读入排序结果时,产生direct path read等待事件。

使用HASH连接的SQL语句,将不适合位于内存中的散列分区刷新到临时表空间中。为了查明匹配SQL谓词的行,临时表空间中的散列分区被读回到内存中(目的是为了查明匹配SQL谓词的行),ORALCE会话在direct path read等待事件上等待。

使用并行扫描的SQL语句也会影响系统范围的direct path read等待事件。在并行执行过程中,direct path read等待事件与从属查询有关,而与父查询无关,运行父查询的会话基本上会在PX Deq:Execute Reply上等待,从属查询会产生direct path read等待事件我们来验证一下这些语句的执行计划;

技术分享

那么有朋友想问了,ogg重启和他有什么关系?这是我们这个系统专门写的脚本,进行判断,为了就是怕将归档日志给删除,倒是ogg进程异常停止,没想到那段时间的归档日志已经被脚本转移到其他目录之下,这样ogg的抽取进程就报了如下错误;

2016-11-08 19:10:15  ERROR   OGG-00446  Could not find archived log for sequence 144299 thread 1 under default destinations SQL <SELECT  name    FROM v$arch

ived_log   WHERE sequence# = :ora_seq_no AND         thread# = :ora_thread AND         resetlogs_id = :ora_resetlog_id AND         archived = ‘YES‘ AND

    deleted = ‘NO‘ AND         name not like ‘+%‘         AND standby_dest = ‘NO‘ >, error retrieving redo file name for sequence 144299, archived = 1, use_

alternate = 0Not able to establish initial position for sequence 144299, rba 243031568.

OGG-00446也指明了Could not find archived log for sequence 144299 thread 1,我们将故障点以后的ogg进行移回归档目录操作。

技术分享

因为脚本是用root跑的,权限也都有所改变。再将权限重新赋给oracle用户和它的属组

 root@SIDDB01 # cd /oracle/oradata/archlog/sidarchlog/
root@SIDDB01 # ls -ltar
total 8177360
drwxr-xr-x   6 oracle   dba         8192 Jan  6  2015 ..
-r--r-----   1 oracle   dba      759247872 Jul  7  2015 2_152214_784815576.dbf
-r--r-----   1 oracle   dba         1024 Jul  7  2015 2_152215_784815576.dbf
-r--r-----   1 oracle   dba         1024 Jul  7  2015 2_152216_784815576.dbf
-r--r-----   1 oracle   dba         1024 Jul  7  2015 2_152217_784815576.dbf
-r--r-----   1 oracle   dba      311396864 Nov  8 17:20 1_144306_784815576.dbf
-r--r-----   1 oracle   dba      716086272 Nov  8 17:34 1_144307_784815576.dbf
-r--r-----   1 oracle   dba      187094016 Nov  8 17:49 1_144308_784815576.dbf
-r--r-----   1 oracle   dba      160567296 Nov  8 18:01 1_144309_784815576.dbf
-r--r-----   1 oracle   dba      283110912 Nov  8 18:21 1_144310_784815576.dbf
-r--r-----   1 oracle   dba      316848640 Nov  8 18:47 1_144311_784815576.dbf
-r--r-----   1 oracle   dba      194941952 Nov  8 19:03 1_144312_784815576.dbf
-r--r-----   1 root     root     279188480 Nov  8 19:31 1_144299_784815576.dbf
-r--r-----   1 root     root     242944512 Nov  8 19:31 1_144300_784815576.dbf
drwxr-xr-x   2 oracle   dba        16384 Nov  8 19:39 .
-rw-r--r--   1 oracle   dba          391 Nov  8 19:40 ggserr.log
-r--r-----   1 root     root     167494144 Nov  8 19:45 1_144301_784815576.dbf
-r--r-----   1 root     root     199065088 Nov  8 19:46 1_144302_784815576.dbf
-r--r-----   1 root     root     271761408 Nov  8 19:47 1_144303_784815576.dbf
-r--r-----   1 root     root     96945664 Nov  8 19:47 1_144304_784815576.dbf
-r--r-----   1 root     root        1024 Nov  8 19:47 1_144305_784815576.dbf
root@SIDDB01 # chown oracle:dba 1_144*
root@SIDDB01 # ls -ltar
total 8177360
drwxr-xr-x   6 oracle   dba         8192 Jan  6  2015 ..
-r--r-----   1 oracle   dba      759247872 Jul  7  2015 2_152214_784815576.dbf
-r--r-----   1 oracle   dba         1024 Jul  7  2015 2_152215_784815576.dbf
-r--r-----   1 oracle   dba         1024 Jul  7  2015 2_152216_784815576.dbf
-r--r-----   1 oracle   dba         1024 Jul  7  2015 2_152217_784815576.dbf
-r--r-----   1 oracle   dba      311396864 Nov  8 17:20 1_144306_784815576.dbf
-r--r-----   1 oracle   dba      716086272 Nov  8 17:34 1_144307_784815576.dbf
-r--r-----   1 oracle   dba      187094016 Nov  8 17:49 1_144308_784815576.dbf
-r--r-----   1 oracle   dba      160567296 Nov  8 18:01 1_144309_784815576.dbf
-r--r-----   1 oracle   dba      283110912 Nov  8 18:21 1_144310_784815576.dbf
-r--r-----   1 oracle   dba      316848640 Nov  8 18:47 1_144311_784815576.dbf
-r--r-----   1 oracle   dba      194941952 Nov  8 19:03 1_144312_784815576.dbf
-r--r-----   1 oracle   dba      279188480 Nov  8 19:31 1_144299_784815576.dbf
-r--r-----   1 oracle   dba      242944512 Nov  8 19:31 1_144300_784815576.dbf
drwxr-xr-x   2 oracle   dba        16384 Nov  8 19:39 .
-rw-r--r--   1 oracle   dba          391 Nov  8 19:40 ggserr.log
-r--r-----   1 oracle   dba      167494144 Nov  8 19:45 1_144301_784815576.dbf
-r--r-----   1 oracle   dba      199065088 Nov  8 19:46 1_144302_784815576.dbf
-r--r-----   1 oracle   dba      271761408 Nov  8 19:47 1_144303_784815576.dbf
-r--r-----   1 oracle   dba      96945664 Nov  8 19:47 1_144304_784815576.dbf
-r--r-----   1 oracle   dba         1024 Nov  8 19:47 1_144305_784815576.dbf

然后我们在重新启动extract进程

 GGSCI (SIDDB01) 17> info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     RUNNING                                           
JAGENT      STOPPED                                           
EXTRACT     RUNNING     DP01        00:00:00      00:00:03    
EXTRACT     RUNNING     DP02        01:10:28      00:00:05    
EXTRACT     RUNNING     DP03        00:00:00      00:00:10    
EXTRACT     RUNNING     DP04        01:11:41      00:00:06    
EXTRACT     RUNNING     EXT1        00:00:01      00:00:08    
EXTRACT     RUNNING     EXT2        01:10:37      00:00:07    
EXTRACT     RUNNING     EXT3        00:00:01      00:00:09    
EXTRACT     RUNNING     EXT4        01:11:31      00:00:02    

进程启动正常,view report ext4也没有报错。剩下的交给时间慢慢抽取,感谢大家。

本文出自 “杨俊峰的博文” 博客,谢绝转载!

direct path write temp引起的数据库重启以及ogg进程停止