首页 > 代码库 > 11G RAC 11.2.0.1.0实例evict故障处理

11G RAC 11.2.0.1.0实例evict故障处理

Aix 7.1

参考文档:

https://blogs.oracle.com/database4cn/rac

Resolving ORA-481 and "terminating the instance due to error 481" (Doc ID 1950963.1)

ORA-00481 After "The instance eviction reason is 0x2" due to Lack of Ticket (Doc ID 1644015.1)

归根结底是RAC节点之间消息传输流量控制的问题,除网络、硬件原因外,需要打补丁解决。

以下为第一次处理前具体的alertlog。

◆node1
Wed May 24 01:59:24 2017
Remote instance 2 kill is issued with system inc 228
LMON received an instance eviction notification from instance 1
The instance eviction reason is 0x2
The instance eviction map is 2
Reconfiguration started (old inc 228, new inc 230)
List of instances:
1 (myinst: 1)
Wed May 24 01:59:27 2017
Trace dumping is performing id=[cdmp_20170524015904]
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Wed May 24 01:59:28 2017
LMS 1: 14 GCS shadows cancelled, 1 closed, 0 Xw survived
Wed May 24 01:59:28 2017
LMS 0: 23 GCS shadows cancelled, 0 closed, 0 Xw survived
Wed May 24 01:59:28 2017
LMS 2: 21 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Post SMON to start 1st pass IR
Wed May 24 01:59:32 2017
Instance recovery: looking for dead threads
Beginning instance recovery of 1 threads
Wed May 24 01:59:48 2017
parallel recovery started with 32 processes
Started redo scan
Wed May 24 01:59:49 2017
Submitted all GCS remote-cache requests
Post SMON to start 1st pass IR
Fix write in gcs resources
Reconfiguration complete
Wed May 24 02:00:00 2017
Completed redo scan
read 2094379 KB redo, 237345 data blocks need recovery
Wed May 24 02:00:03 2017
Reconfiguration started (old inc 230, new inc 232)
List of instances:
1 2 (myinst: 1)
Global Resource Directory frozen
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Wed May 24 02:00:04 2017
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Wed May 24 02:00:04 2017
Wed May 24 02:00:04 2017
LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Post SMON to start 1st pass IR
Wed May 24 02:02:27 2017
,
,此处大量12170
,
Wed May 24 02:05:34 2017
LMON (ospid: 35979900): terminating the instance due to error 481
Wed May 24 02:05:34 2017
System state dump is made for local instance
System State dumped to trace file /u01/app/11.2.0/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_diag_40894656.trc
Instance terminated by LMON, pid = 35979900
Wed May 24 08:12:27 2017
Starting ORACLE instance (normal)
sskgpgetexecname failed to get name
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Interface type 1 en1 192.168.0.0 configured from GPnP Profile for use as a cluster interconnect
Interface type 1 en0 10.209.199.0 configured from GPnP Profile for use as a public interface
Picked latch-free SCN scheme 3
Using LOG_ARCHIVE_DEST_1 parameter default value as USE_DB_RECOVERY_FILE_DEST
Autotune of undo retention is turned on.
LICENSE_MAX_USERS = 0
SYS auditing is disabled
Starting up:
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options.
Using parameter settings in server-side pfile /u01/app/11.2.0/oracle/product/11.2.0/db_1/dbs/initorcl1.ora
System parameters with non-default values:
processes = 2000
timed_statistics = TRUE
sga_max_size = 70G
spfile = "+DATA1/orcl/spfileorcl.ora"
sga_target = 50G
control_files = "+DATA1/orcl/controlfile/current.261.832429113"
control_files = "+ARCDG/orcl/controlfile/current.259.832429113"
db_block_size = 8192
compatible = "11.2.0.0.0"
log_archive_format = "%t_%s_%r.dbf"
cluster_database = TRUE
db_create_file_dest = "+DATA1"
db_recovery_file_dest = "+ARCDG"
db_recovery_file_dest_size= 2048G
thread = 1
undo_tablespace = "UNDOTBS1"
undo_retention = 10800
instance_number = 1
remote_login_passwordfile= "NONE"
db_domain = ""
service_names = "orcl"
dispatchers = "(PROTOCOL=TCP) (SERVICE=orclXDB)"
local_listener = "(ADDRESS = (PROTOCOL = TCP)(HOST = 10.209.199.4)(PORT = 1521))"
remote_listener = "rac-scan:1521"
result_cache_max_size = 268736K
audit_file_dest = "/u01/app/11.2.0/oracle/admin/orcl/adump"
audit_trail = "DB"
db_name = "orcl"
open_cursors = 1000
sql_trace = FALSE
optimizer_index_caching = 90
pga_aggregate_target = 20G
deferred_segment_creation= FALSE
aq_tm_processes = 5
diagnostic_dest = "/u01/app/11.2.0/oracle"
Deprecated system parameters with specified values:
sql_trace
End of deprecated system parameter listing
Cluster communication is configured to use the following interface(s) for this instance
192.168.0.1
cluster interconnect IPC version:Oracle UDP/IP (generic)
IPC Vendor 1 proto 2
Wed May 24 08:12:32 2017
PMON started with pid=2, OS id=35652038
Wed May 24 08:12:32 2017
VKTM started with pid=3, OS id=33489120 at elevated priority
VKTM running at (10)millisec precision with DBRM quantum (100)ms
Wed May 24 08:12:32 2017
GEN0 started with pid=4, OS id=34209898
Wed May 24 08:12:32 2017
DIAG started with pid=5, OS id=36438538
Wed May 24 08:12:33 2017
DBRM started with pid=6, OS id=24576446
Wed May 24 08:12:33 2017
PING started with pid=7, OS id=29687920
Wed May 24 08:12:33 2017
PSP0 started with pid=8, OS id=36635296
Wed May 24 08:12:33 2017
ACMS started with pid=9, OS id=34341462
Wed May 24 08:12:33 2017
DIA0 started with pid=10, OS id=32047870
Wed May 24 08:12:33 2017
LMON started with pid=11, OS id=36045304
Wed May 24 08:12:35 2017
LMD0 started with pid=12, OS id=31391860
Wed May 24 08:12:35 2017
LMS0 started with pid=13, OS id=35979918 at elevated priority
Wed May 24 08:12:35 2017
LMS1 started with pid=14, OS id=22741294 at elevated priority
Wed May 24 08:12:36 2017
LMS2 started with pid=15, OS id=29229354 at elevated priority
Wed May 24 08:12:36 2017
RMS0 started with pid=16, OS id=20185780
Wed May 24 08:12:36 2017
LMHB started with pid=17, OS id=29425998
Wed May 24 08:12:36 2017
MMAN started with pid=18, OS id=9699954
Wed May 24 08:12:36 2017
DBW0 started with pid=19, OS id=31850686
Wed May 24 08:12:36 2017
DBW1 started with pid=20, OS id=36045442
Wed May 24 08:12:36 2017
DBW2 started with pid=21, OS id=40894536
Wed May 24 08:12:36 2017
DBW3 started with pid=22, OS id=29819564
Wed May 24 08:12:36 2017
DBW4 started with pid=23, OS id=36634890
Wed May 24 08:12:36 2017
LGWR started with pid=24, OS id=35783370
Wed May 24 08:12:37 2017
CKPT started with pid=25, OS id=35914258
Wed May 24 08:12:37 2017
SMON started with pid=26, OS id=32374796
Wed May 24 08:12:37 2017
RECO started with pid=27, OS id=35652280
Wed May 24 08:12:37 2017
RBAL started with pid=28, OS id=36438322
Wed May 24 08:12:37 2017
ASMB started with pid=29, OS id=39387144
Wed May 24 08:12:37 2017
MMON started with pid=30, OS id=2621900
NOTE: initiating MARK startup
Wed May 24 08:12:37 2017
Starting background process MARKMMNL started with pid=31, OS id=35979732

starting up 1 dispatcher(s) for network address (ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))...
Wed May 24 08:12:37 2017
MARK started with pid=32, OS id=35717688
NOTE: MARK has subscribed
starting up 1 shared server(s) ...
lmon registered with NM - instance number 1 (internal mem no 0)
Reconfiguration started (old inc 0, new inc 236)
List of instances:
1 2 (myinst: 1)
Global Resource Directory frozen
* allocate domain 0, invalid = TRUE
Communication channels reestablished
* domain 0 valid according to instance 2
* domain 0 valid = 1 according to instance 2
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Wed May 24 08:12:45 2017
Submitted all GCS remote-cache requests
Fix write in gcs resources
Reconfiguration complete
Wed May 24 08:12:52 2017
LCK0 started with pid=34, OS id=24379778
Wed May 24 08:12:52 2017
Starting background process RSMN
Wed May 24 08:12:52 2017
RSMN started with pid=36, OS id=30802026
ORACLE_BASE from environment = /u01/app/11.2.0/oracle
Wed May 24 08:12:53 2017
ALTER DATABASE MOUNT
Wed May 24 08:12:53 2017
NOTE: Loaded library: System
Wed May 24 08:12:53 2017
SUCCESS: diskgroup DATA1 was mounted
SUCCESS: diskgroup ARCDG was mounted
Wed May 24 08:12:53 2017
NOTE: dependency between database orcl and diskgroup resource ora.DATA1.dg is established
NOTE: dependency between database orcl and diskgroup resource ora.ARCDG.dg is established
Wed May 24 08:12:57 2017
Successful mount of redo thread 1, with mount id 1472569957
Database mounted in Shared Mode (CLUSTER_DATABASE=TRUE)
Lost write protection disabled
Completed: ALTER DATABASE MOUNT
Wed May 24 08:12:58 2017
ALTER DATABASE OPEN
Block change tracking file is current.
Picked broadcast on commit scheme to generate SCNs
Wed May 24 08:12:58 2017
SUCCESS: diskgroup DATA2 was mounted
NOTE: dependency between database orcl and diskgroup resource ora.DATA2.dg is established
SUCCESS: diskgroup DATA3 was mounted
NOTE: dependency between database orcl and diskgroup resource ora.DATA3.dg is established
Thread 1 advanced to log sequence 255944 (thread open)
Thread 1 opened at log sequence 255944
Current log# 2 seq# 255944 mem# 0: +ARCDG/orcl/onlinelog/group_2.261.840376035
Current log# 2 seq# 255944 mem# 1: +ARCDG/orcl/onlinelog/group_2.2113.840376039
Current log# 2 seq# 255944 mem# 2: +ARCDG/orcl/onlinelog/group_2.2114.840376041
Current log# 2 seq# 255944 mem# 3: +ARCDG/orcl/onlinelog/group_2.2116.840376043
Successful open of redo thread 1
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
Starting background process CTWR
Wed May 24 08:12:59 2017
CTWR started with pid=39, OS id=32768642
Block change tracking service is active.
Wed May 24 08:13:00 2017
SMON: enabling cache recovery
Successfully onlined Undo Tablespace 2.
Verifying file header compatibility for 11g tablespace encryption..
Verifying 11g file header compatibility for tablespace encryption completed
SMON: enabling tx recovery
Database Characterset is AL32UTF8
No Resource Manager plan active
Starting background process GTX0
Wed May 24 08:13:07 2017
GTX0 started with pid=42, OS id=35062526
Starting background process RCBG
Wed May 24 08:13:07 2017
RCBG started with pid=43, OS id=27721944
replication_dependency_tracking turned off (no async multimaster replication found)
Wed May 24 08:13:08 2017
Starting background process QMNC
Wed May 24 08:13:08 2017
QMNC started with pid=44, OS id=34013894
Completed: ALTER DATABASE OPEN
Wed May 24 08:13:17 2017
Starting background process CJQ0
Wed May 24 08:13:17 2017
CJQ0 started with pid=41, OS id=11469004
Wed May 24 08:13:43 2017
Starting background process SMCO
Wed May 24 08:13:43 2017
SMCO started with pid=120, OS id=42336480
,
,
,
Wed May 24 09:08:16 2017
Errors in file /u01/app/11.2.0/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_j001_33161436.trc:
ORA-12012: 鑎仟錨狮鎊骚琛^啼絕茕竈239 錨呛閊証
ORA-00001: 杩^蒎^蟐湾^辕涓^犁害鎊較哝^荨浠(CM.UK_LTE_ZY_CHECK_HISTORY)
ORA-06512: 錨塄 "CM.JOB_LTE_ZY_CHECK", line 27
ORA-06512: 錨塄 line 1

◆node2
Wed May 24 01:59:01 2017
LMS1 (ospid: 26411086) received an instance eviction notification from instance 1 [2]
Wed May 24 01:59:01 2017
LMON received an instance eviction notification from instance 1
The instance eviction reason is 0x2
The instance eviction map is 2
Wed May 24 01:59:04 2017
PMON (ospid: 50725354): terminating the instance due to error 481
Wed May 24 01:59:04 2017
System state dump is made for local instance
System State dumped to trace file /u01/app/11.2.0/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_diag_42336686.trc
Wed May 24 01:59:04 2017
ORA-1092 : opitsk aborting process
Wed May 24 01:59:04 2017
License high water mark = 516
Instance terminated by PMON, pid = 50725354
USER (ospid: 50987430): terminating the instance
Instance terminated by USER, pid = 50987430
Wed May 24 01:59:29 2017
Starting ORACLE instance (normal)
sskgpgetexecname failed to get name
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Interface type 1 en1 192.168.0.0 configured from GPnP Profile for use as a cluster interconnect
Interface type 1 en0 10.209.199.0 configured from GPnP Profile for use as a public interface
Picked latch-free SCN scheme 3
Using LOG_ARCHIVE_DEST_1 parameter default value as USE_DB_RECOVERY_FILE_DEST
Autotune of undo retention is turned on.
LICENSE_MAX_USERS = 0
SYS auditing is disabled
Starting up:
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options.
Using parameter settings in server-side pfile /u01/app/11.2.0/oracle/product/11.2.0/db_1/dbs/initorcl2.ora
System parameters with non-default values:
processes = 2000
timed_statistics = TRUE
sga_max_size = 70G
spfile = "+DATA1/orcl/spfileorcl.ora"
sga_target = 50G
control_files = "+DATA1/orcl/controlfile/current.261.832429113"
control_files = "+ARCDG/orcl/controlfile/current.259.832429113"
db_block_size = 8192
compatible = "11.2.0.0.0"
log_archive_format = "%t_%s_%r.dbf"
cluster_database = TRUE
db_create_file_dest = "+DATA1"
db_recovery_file_dest = "+ARCDG"
db_recovery_file_dest_size= 2048G
thread = 2
undo_tablespace = "UNDOTBS2"
undo_retention = 10800
instance_number = 2
remote_login_passwordfile= "NONE"
db_domain = ""
dispatchers = "(PROTOCOL=TCP) (SERVICE=orclXDB)"
local_listener = "(ADDRESS = (PROTOCOL = TCP)(HOST = 10.209.199.5)(PORT = 1521))"
remote_listener = "rac-scan:1521"
result_cache_max_size = 268736K
audit_file_dest = "/u01/app/11.2.0/oracle/admin/orcl/adump"
audit_trail = "DB"
db_name = "orcl"
open_cursors = 1000
sql_trace = FALSE
optimizer_index_caching = 90
pga_aggregate_target = 20G
deferred_segment_creation= FALSE
aq_tm_processes = 5
diagnostic_dest = "/u01/app/11.2.0/oracle"
Deprecated system parameters with specified values:
sql_trace
End of deprecated system parameter listing
Cluster communication is configured to use the following interface(s) for this instance
192.168.0.2
cluster interconnect IPC version:Oracle UDP/IP (generic)
IPC Vendor 1 proto 2
Wed May 24 01:59:34 2017
PMON started with pid=2, OS id=66584656
Wed May 24 01:59:34 2017
VKTM started with pid=3, OS id=66846744 at elevated priority
VKTM running at (10)millisec precision with DBRM quantum (100)ms
Wed May 24 01:59:34 2017
GEN0 started with pid=4, OS id=26608090
Wed May 24 01:59:34 2017
DIAG started with pid=5, OS id=26083716
Wed May 24 01:59:34 2017
DBRM started with pid=6, OS id=24510972
Wed May 24 01:59:34 2017
PING started with pid=7, OS id=65077306
Wed May 24 01:59:34 2017
PSP0 started with pid=8, OS id=66781402
Wed May 24 01:59:34 2017
ACMS started with pid=9, OS id=66978040
Wed May 24 01:59:34 2017
DIA0 started with pid=10, OS id=66519050
Wed May 24 01:59:34 2017
LMON started with pid=11, OS id=66453694
Wed May 24 01:59:37 2017
LMD0 started with pid=12, OS id=23658758
Wed May 24 01:59:37 2017
LMS0 started with pid=13, OS id=66322614 at elevated priority
Wed May 24 01:59:37 2017
LMS1 started with pid=14, OS id=65798210 at elevated priority
Wed May 24 01:59:37 2017
LMS2 started with pid=15, OS id=15204812 at elevated priority
Wed May 24 01:59:38 2017
RMS0 started with pid=16, OS id=65732820
Wed May 24 01:59:38 2017
LMHB started with pid=17, OS id=65339438
Wed May 24 01:59:38 2017
MMAN started with pid=18, OS id=52298036
Wed May 24 01:59:38 2017
DBW0 started with pid=19, OS id=52232458
Wed May 24 01:59:38 2017
DBW1 started with pid=20, OS id=65273872
Wed May 24 01:59:38 2017
DBW2 started with pid=21, OS id=65208354
Wed May 24 01:59:38 2017
DBW3 started with pid=22, OS id=65994980
Wed May 24 01:59:38 2017
DBW4 started with pid=23, OS id=61210752
Wed May 24 01:59:38 2017
LGWR started with pid=24, OS id=6095298
Wed May 24 01:59:38 2017
CKPT started with pid=25, OS id=65470648
Wed May 24 01:59:38 2017
SMON started with pid=26, OS id=64946374
Wed May 24 01:59:38 2017
RECO started with pid=27, OS id=51642854
Wed May 24 01:59:39 2017
RBAL started with pid=28, OS id=64618728
Wed May 24 01:59:39 2017
ASMB started with pid=29, OS id=64553208
Wed May 24 01:59:39 2017
MMON started with pid=30, OS id=64422054
NOTE: initiating MARK startup
Wed May 24 01:59:39 2017
MMNL started with pid=31, OS id=42336716
Starting background process MARK
starting up 1 dispatcher(s) for network address (ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))...
Wed May 24 01:59:39 2017
MARK started with pid=32, OS id=51577168
NOTE: MARK has subscribed
Wed May 24 01:59:39 2017
starting up 1 shared server(s) ...
lmon registered with NM - instance number 2 (internal mem no 1)
Reconfiguration started (old inc 0, new inc 232)
List of instances:
1 2 (myinst: 2)
Global Resource Directory frozen
* allocate domain 0, invalid = TRUE
Communication channels reestablished
* domain 0 valid = 0 according to instance 1
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Wed May 24 01:59:48 2017
Submitted all GCS remote-cache requests
Wed May 24 02:05:10 2017
Trace dumping is performing id=[cdmp_20170524020534]
Wed May 24 02:05:18 2017
Reconfiguration started (old inc 232, new inc 234)
List of instances:
2 (myinst: 2)
Nested reconfiguration detected.
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Wed May 24 02:05:19 2017
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Wed May 24 02:05:19 2017
LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Wed May 24 02:05:19 2017
LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Post SMON to start 1st pass IR
Submitted all GCS remote-cache requests
Post SMON to start 1st pass IR
Fix write in gcs resources
Reconfiguration complete
Wed May 24 02:05:25 2017
LCK0 started with pid=34, OS id=51118396
Wed May 24 02:05:25 2017
Starting background process RSMN
Wed May 24 02:05:25 2017
RSMN started with pid=36, OS id=51773786
ORACLE_BASE not set in environment. It is recommended
that ORACLE_BASE be set in the environment
Reusing ORACLE_BASE from an earlier startup = /u01/app/11.2.0/oracle
Wed May 24 02:05:25 2017
ALTER DATABASE MOUNT
This instance was first to mount
Wed May 24 02:05:26 2017
NOTE: Loaded library: System
Wed May 24 02:05:26 2017
SUCCESS: diskgroup DATA1 was mounted
Wed May 24 02:05:26 2017
NOTE: dependency between database orcl and diskgroup resource ora.DATA1.dg is established
SUCCESS: diskgroup ARCDG was mounted
NOTE: dependency between database orcl and diskgroup resource ora.ARCDG.dg is established
Wed May 24 02:05:30 2017
Successful mount of redo thread 2, with mount id 1472569957
Database mounted in Shared Mode (CLUSTER_DATABASE=TRUE)
Lost write protection disabled
Completed: ALTER DATABASE MOUNT
ALTER DATABASE OPEN
This instance was first to open
Wed May 24 02:05:30 2017
SUCCESS: diskgroup DATA2 was mounted
NOTE: dependency between database orcl and diskgroup resource ora.DATA2.dg is established
SUCCESS: diskgroup DATA3 was mounted
NOTE: dependency between database orcl and diskgroup resource ora.DATA3.dg is established
Block change tracking file is current.
Beginning crash recovery of 2 threads
parallel recovery started with 32 processes
Started redo scan
Wed May 24 02:05:44 2017
Completed redo scan
read 2158203 KB redo, 230150 data blocks need recovery
Started redo application at
Thread 1: logseq 255942, block 163388
Thread 2: logseq 236286, block 2067920
Recovery of Online Redo Log: Thread 1 Group 3 Seq 255942 Reading mem 0
Mem# 0: +ARCDG/orcl/onlinelog/group_3.2085.840375427
Mem# 1: +ARCDG/orcl/onlinelog/group_3.2084.840375445
Mem# 2: +ARCDG/orcl/onlinelog/group_3.263.840375447
Mem# 3: +ARCDG/orcl/onlinelog/group_3.2089.840375449
Recovery of Online Redo Log: Thread 2 Group 7 Seq 236286 Reading mem 0
Mem# 0: +ARCDG/orcl/onlinelog/group_7.2049.840374587
Mem# 1: +ARCDG/orcl/onlinelog/group_7.2052.840374591
Mem# 2: +ARCDG/orcl/onlinelog/group_7.2057.840374593
Mem# 3: +ARCDG/orcl/onlinelog/group_7.2058.840374597
Recovery of Online Redo Log: Thread 2 Group 5 Seq 236287 Reading mem 0
Mem# 0: +ARCDG/orcl/onlinelog/group_5.2025.840374445
Mem# 1: +ARCDG/orcl/onlinelog/group_5.2026.840374483
Mem# 2: +ARCDG/orcl/onlinelog/group_5.2029.840374489
Mem# 3: +ARCDG/orcl/onlinelog/group_5.2034.840374493
Wed May 24 02:06:05 2017
Completed redo application of 1702.07MB
Completed crash recovery at
Thread 1: logseq 255942, block 291036, scn 12495715727030
Thread 2: logseq 236287, block 2742038, scn 12495715600520
230150 data blocks read, 229989 data blocks written, 2158203 redo k-bytes read
Thread 1 advanced to log sequence 255943 (thread recovery)
Picked broadcast on commit scheme to generate SCNs
Wed May 24 02:06:06 2017
Thread 2 advanced to log sequence 236288 (thread open)
Thread 2 opened at log sequence 236288
Current log# 6 seq# 236288 mem# 0: +ARCDG/orcl/onlinelog/group_6.2040.840374575
Current log# 6 seq# 236288 mem# 1: +ARCDG/orcl/onlinelog/group_6.2041.840374579
Current log# 6 seq# 236288 mem# 2: +ARCDG/orcl/onlinelog/group_6.2042.840374581
Current log# 6 seq# 236288 mem# 3: +ARCDG/orcl/onlinelog/group_6.2048.840374585
Successful open of redo thread 2
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
Starting background process CTWR
Wed May 24 02:06:07 2017
CTWR started with pid=89, OS id=49349114
Block change tracking service is active.
Wed May 24 02:06:07 2017
SMON: enabling cache recovery
Successfully onlined Undo Tablespace 4.
Verifying file header compatibility for 11g tablespace encryption..
Verifying 11g file header compatibility for tablespace encryption completed
SMON: enabling tx recovery
Database Characterset is AL32UTF8
No Resource Manager plan active
Starting background process GTX0
Wed May 24 02:06:11 2017
GTX0 started with pid=98, OS id=42992042
Starting background process RCBG
Wed May 24 02:06:11 2017
RCBG started with pid=52, OS id=40370648
replication_dependency_tracking turned off (no async multimaster replication found)
Starting background process QMNC
Wed May 24 02:06:13 2017
QMNC started with pid=119, OS id=47514086
Completed: ALTER DATABASE OPEN
SMON: Parallel transaction recovery tried
Starting background process SMCO
Wed May 24 02:06:17 2017
SMCO started with pid=153, OS id=50724962
Wed May 24 02:06:17 2017
db_recovery_file_dest_size of 2097152 MB is 4.35% used. This is a
user-specified limit on the amount of space that will be used by this
database for recovery-related files, and does not reflect the amount of
space available in the underlying filesystem or ASM diskgroup.
Wed May 24 02:06:19 2017
Starting background process CJQ0
Wed May 24 02:06:19 2017
CJQ0 started with pid=46, OS id=43581760
Wed May 24 02:07:35 2017
Thread 2 advanced to log sequence 236289 (LGWR switch)
Current log# 7 seq# 236289 mem# 0: +ARCDG/orcl/onlinelog/group_7.2049.840374587
Current log# 7 seq# 236289 mem# 1: +ARCDG/orcl/onlinelog/group_7.2052.840374591
Current log# 7 seq# 236289 mem# 2: +ARCDG/orcl/onlinelog/group_7.2057.840374593
Current log# 7 seq# 236289 mem# 3: +ARCDG/orcl/onlinelog/group_7.2058.840374597
Wed May 24 02:08:09 2017
Thread 2 advanced to log sequence 236290 (LGWR switch)
Current log# 5 seq# 236290 mem# 0: +ARCDG/orcl/onlinelog/group_5.2025.840374445
Current log# 5 seq# 236290 mem# 1: +ARCDG/orcl/onlinelog/group_5.2026.840374483
Current log# 5 seq# 236290 mem# 2: +ARCDG/orcl/onlinelog/group_5.2029.840374489
Current log# 5 seq# 236290 mem# 3: +ARCDG/orcl/onlinelog/group_5.2034.840374493
Wed May 24 02:08:51 2017
Thread 2 advanced to log sequence 236291 (LGWR switch)
Current log# 6 seq# 236291 mem# 0: +ARCDG/orcl/onlinelog/group_6.2040.840374575
Current log# 6 seq# 236291 mem# 1: +ARCDG/orcl/onlinelog/group_6.2041.840374579
Current log# 6 seq# 236291 mem# 2: +ARCDG/orcl/onlinelog/group_6.2042.840374581
Current log# 6 seq# 236291 mem# 3: +ARCDG/orcl/onlinelog/group_6.2048.840374585
Wed May 24 02:09:30 2017
Thread 2 advanced to log sequence 236292 (LGWR switch)
Current log# 7 seq# 236292 mem# 0: +ARCDG/orcl/onlinelog/group_7.2049.840374587
Current log# 7 seq# 236292 mem# 1: +ARCDG/orcl/onlinelog/group_7.2052.840374591
Current log# 7 seq# 236292 mem# 2: +ARCDG/orcl/onlinelog/group_7.2057.840374593
Current log# 7 seq# 236292 mem# 3: +ARCDG/orcl/onlinelog/group_7.2058.840374597

针对这个RAC集群,做了加多的配置修正、优化,主要是如下所述。

◆数据库配置优化

0、前期大量的日志切换无法完成,导致数据库挂起,为此新增了loggroup,并删掉无用多余的logmember。

1、增大了ASM实例的内存,由350M--->2G

2、RAC并行进程强制在本地实例分配

3、为减少对性能的影响,关闭audit审计

4、AWR 资料保存时间由7天增加到30天

5、SGA增大10G至60G

◆修改网络相关的系统核心参数

/usr/sbin/no -p -o tcp_sendspace=4194304

/usr/sbin/no -p -o tcp_recvspace=4194304

/usr/sbin/no -p -o rfc1323=1

/usr/sbin/no -p -o sb_max=8388608

/usr/sbin/no -p -o udp_ephemeral_low=9000

/usr/sbin/no -p -o tcp_ephemeral_low=9000

◆修改内联网卡的参数

chdev -l en1 -a tcp_sendspace=1048576 -a rfc1323=1 -a tcp_recvspace=1048576

◆SWAP

物理内存大于 16GB,但 SWAP 空间低于 16 GB,不符合 Oracle 安装的标准建议。增大PS。chps -s 192 hd6

◆时区及时间确认,安全起见,暂不修改。

◆ntp确认,使用GI集群自身的时间同步,确保各个节点的时间同步,目前是几十秒的偏差,不是问题。

◆HA集群配置修改,当初安装集群时部分内容没有按照官方文档实施,本次做了补救,但不全面,可能仍存在未知的风险。

5月25日重启后,经多一段时间的观察,有所改善,但于6月7日9点又发生一次节点2当。

但此次已经不是节点的驱逐,说明上次的修正与配置优化起到了作用,初步断定与网络参数配置修改相关。日志如下,无evict日志。

Wed Jun 07 09:02:57 2017
Errors in file /u01/app/11.2.0/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_lms0_7930030.trc  (incident=1023882):
ORA-00600: internal error code, arguments: [kjbrref:pkey], [3881577], [6], [7493495], [0], [], [], [], [], [], [], []
Incident details in: /u01/app/11.2.0/oracle/diag/rdbms/orcl/orcl2/incident/incdir_1023882/orcl2_lms0_7930030_i1023882.trc
Wed Jun 07 09:02:59 2017
Trace dumping is performing id=[cdmp_20170607090259]
Errors in file /u01/app/11.2.0/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_lms0_7930030.trc:
ORA-00600: internal error code, arguments: [kjbrref:pkey], [3881577], [6], [7493495], [0], [], [], [], [], [], [], []
LMS0 (ospid: 7930030): terminating the instance due to error 484
Instance terminated by LMS0, pid = 7930030
Wed Jun 07 09:14:01 2017
Starting ORACLE instance (normal)

针对此次的600内部错误,初步结论:

5月25日优化重启后,6月7日又发生过1次节点down的情况,经排查,基本确认是oracle BUG,在某些情况下会发生节点down。目前数据库PSU版本11.2.0.1.0,需要打补丁到最新版本11.2.0.1.6。后续需择机打补丁。

11G RAC 11.2.0.1.0实例evict故障处理