首页 > 代码库 > 11G RAC 11.2.0.1.0实例evict故障处理
11G RAC 11.2.0.1.0实例evict故障处理
Aix 7.1
参考文档:
https://blogs.oracle.com/database4cn/rac
Resolving ORA-481 and "terminating the instance due to error 481" (Doc ID 1950963.1)
ORA-00481 After "The instance eviction reason is 0x2" due to Lack of Ticket (Doc ID 1644015.1)
归根结底是RAC节点之间消息传输流量控制的问题,除网络、硬件原因外,需要打补丁解决。
以下为第一次处理前具体的alertlog。
◆node1 Wed May 24 01:59:24 2017 Remote instance 2 kill is issued with system inc 228 LMON received an instance eviction notification from instance 1 The instance eviction reason is 0x2 The instance eviction map is 2 Reconfiguration started (old inc 228, new inc 230) List of instances: 1 (myinst: 1) Wed May 24 01:59:27 2017 Trace dumping is performing id=[cdmp_20170524015904] Global Resource Directory frozen * dead instance detected - domain 0 invalid = TRUE Communication channels reestablished Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out Wed May 24 01:59:28 2017 LMS 1: 14 GCS shadows cancelled, 1 closed, 0 Xw survived Wed May 24 01:59:28 2017 LMS 0: 23 GCS shadows cancelled, 0 closed, 0 Xw survived Wed May 24 01:59:28 2017 LMS 2: 21 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Post SMON to start 1st pass IR Wed May 24 01:59:32 2017 Instance recovery: looking for dead threads Beginning instance recovery of 1 threads Wed May 24 01:59:48 2017 parallel recovery started with 32 processes Started redo scan Wed May 24 01:59:49 2017 Submitted all GCS remote-cache requests Post SMON to start 1st pass IR Fix write in gcs resources Reconfiguration complete Wed May 24 02:00:00 2017 Completed redo scan read 2094379 KB redo, 237345 data blocks need recovery Wed May 24 02:00:03 2017 Reconfiguration started (old inc 230, new inc 232) List of instances: 1 2 (myinst: 1) Global Resource Directory frozen Communication channels reestablished Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out Wed May 24 02:00:04 2017 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Wed May 24 02:00:04 2017 Wed May 24 02:00:04 2017 LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Post SMON to start 1st pass IR Wed May 24 02:02:27 2017 , ,此处大量12170 , Wed May 24 02:05:34 2017 LMON (ospid: 35979900): terminating the instance due to error 481 Wed May 24 02:05:34 2017 System state dump is made for local instance System State dumped to trace file /u01/app/11.2.0/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_diag_40894656.trc Instance terminated by LMON, pid = 35979900 Wed May 24 08:12:27 2017 Starting ORACLE instance (normal) sskgpgetexecname failed to get name LICENSE_MAX_SESSION = 0 LICENSE_SESSIONS_WARNING = 0 Interface type 1 en1 192.168.0.0 configured from GPnP Profile for use as a cluster interconnect Interface type 1 en0 10.209.199.0 configured from GPnP Profile for use as a public interface Picked latch-free SCN scheme 3 Using LOG_ARCHIVE_DEST_1 parameter default value as USE_DB_RECOVERY_FILE_DEST Autotune of undo retention is turned on. LICENSE_MAX_USERS = 0 SYS auditing is disabled Starting up: Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production With the Partitioning, Real Application Clusters, OLAP, Data Mining and Real Application Testing options. Using parameter settings in server-side pfile /u01/app/11.2.0/oracle/product/11.2.0/db_1/dbs/initorcl1.ora System parameters with non-default values: processes = 2000 timed_statistics = TRUE sga_max_size = 70G spfile = "+DATA1/orcl/spfileorcl.ora" sga_target = 50G control_files = "+DATA1/orcl/controlfile/current.261.832429113" control_files = "+ARCDG/orcl/controlfile/current.259.832429113" db_block_size = 8192 compatible = "11.2.0.0.0" log_archive_format = "%t_%s_%r.dbf" cluster_database = TRUE db_create_file_dest = "+DATA1" db_recovery_file_dest = "+ARCDG" db_recovery_file_dest_size= 2048G thread = 1 undo_tablespace = "UNDOTBS1" undo_retention = 10800 instance_number = 1 remote_login_passwordfile= "NONE" db_domain = "" service_names = "orcl" dispatchers = "(PROTOCOL=TCP) (SERVICE=orclXDB)" local_listener = "(ADDRESS = (PROTOCOL = TCP)(HOST = 10.209.199.4)(PORT = 1521))" remote_listener = "rac-scan:1521" result_cache_max_size = 268736K audit_file_dest = "/u01/app/11.2.0/oracle/admin/orcl/adump" audit_trail = "DB" db_name = "orcl" open_cursors = 1000 sql_trace = FALSE optimizer_index_caching = 90 pga_aggregate_target = 20G deferred_segment_creation= FALSE aq_tm_processes = 5 diagnostic_dest = "/u01/app/11.2.0/oracle" Deprecated system parameters with specified values: sql_trace End of deprecated system parameter listing Cluster communication is configured to use the following interface(s) for this instance 192.168.0.1 cluster interconnect IPC version:Oracle UDP/IP (generic) IPC Vendor 1 proto 2 Wed May 24 08:12:32 2017 PMON started with pid=2, OS id=35652038 Wed May 24 08:12:32 2017 VKTM started with pid=3, OS id=33489120 at elevated priority VKTM running at (10)millisec precision with DBRM quantum (100)ms Wed May 24 08:12:32 2017 GEN0 started with pid=4, OS id=34209898 Wed May 24 08:12:32 2017 DIAG started with pid=5, OS id=36438538 Wed May 24 08:12:33 2017 DBRM started with pid=6, OS id=24576446 Wed May 24 08:12:33 2017 PING started with pid=7, OS id=29687920 Wed May 24 08:12:33 2017 PSP0 started with pid=8, OS id=36635296 Wed May 24 08:12:33 2017 ACMS started with pid=9, OS id=34341462 Wed May 24 08:12:33 2017 DIA0 started with pid=10, OS id=32047870 Wed May 24 08:12:33 2017 LMON started with pid=11, OS id=36045304 Wed May 24 08:12:35 2017 LMD0 started with pid=12, OS id=31391860 Wed May 24 08:12:35 2017 LMS0 started with pid=13, OS id=35979918 at elevated priority Wed May 24 08:12:35 2017 LMS1 started with pid=14, OS id=22741294 at elevated priority Wed May 24 08:12:36 2017 LMS2 started with pid=15, OS id=29229354 at elevated priority Wed May 24 08:12:36 2017 RMS0 started with pid=16, OS id=20185780 Wed May 24 08:12:36 2017 LMHB started with pid=17, OS id=29425998 Wed May 24 08:12:36 2017 MMAN started with pid=18, OS id=9699954 Wed May 24 08:12:36 2017 DBW0 started with pid=19, OS id=31850686 Wed May 24 08:12:36 2017 DBW1 started with pid=20, OS id=36045442 Wed May 24 08:12:36 2017 DBW2 started with pid=21, OS id=40894536 Wed May 24 08:12:36 2017 DBW3 started with pid=22, OS id=29819564 Wed May 24 08:12:36 2017 DBW4 started with pid=23, OS id=36634890 Wed May 24 08:12:36 2017 LGWR started with pid=24, OS id=35783370 Wed May 24 08:12:37 2017 CKPT started with pid=25, OS id=35914258 Wed May 24 08:12:37 2017 SMON started with pid=26, OS id=32374796 Wed May 24 08:12:37 2017 RECO started with pid=27, OS id=35652280 Wed May 24 08:12:37 2017 RBAL started with pid=28, OS id=36438322 Wed May 24 08:12:37 2017 ASMB started with pid=29, OS id=39387144 Wed May 24 08:12:37 2017 MMON started with pid=30, OS id=2621900 NOTE: initiating MARK startup Wed May 24 08:12:37 2017 Starting background process MARKMMNL started with pid=31, OS id=35979732 starting up 1 dispatcher(s) for network address ‘(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))‘... Wed May 24 08:12:37 2017 MARK started with pid=32, OS id=35717688 NOTE: MARK has subscribed starting up 1 shared server(s) ... lmon registered with NM - instance number 1 (internal mem no 0) Reconfiguration started (old inc 0, new inc 236) List of instances: 1 2 (myinst: 1) Global Resource Directory frozen * allocate domain 0, invalid = TRUE Communication channels reestablished * domain 0 valid according to instance 2 * domain 0 valid = 1 according to instance 2 Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Wed May 24 08:12:45 2017 Submitted all GCS remote-cache requests Fix write in gcs resources Reconfiguration complete Wed May 24 08:12:52 2017 LCK0 started with pid=34, OS id=24379778 Wed May 24 08:12:52 2017 Starting background process RSMN Wed May 24 08:12:52 2017 RSMN started with pid=36, OS id=30802026 ORACLE_BASE from environment = /u01/app/11.2.0/oracle Wed May 24 08:12:53 2017 ALTER DATABASE MOUNT Wed May 24 08:12:53 2017 NOTE: Loaded library: System Wed May 24 08:12:53 2017 SUCCESS: diskgroup DATA1 was mounted SUCCESS: diskgroup ARCDG was mounted Wed May 24 08:12:53 2017 NOTE: dependency between database orcl and diskgroup resource ora.DATA1.dg is established NOTE: dependency between database orcl and diskgroup resource ora.ARCDG.dg is established Wed May 24 08:12:57 2017 Successful mount of redo thread 1, with mount id 1472569957 Database mounted in Shared Mode (CLUSTER_DATABASE=TRUE) Lost write protection disabled Completed: ALTER DATABASE MOUNT Wed May 24 08:12:58 2017 ALTER DATABASE OPEN Block change tracking file is current. Picked broadcast on commit scheme to generate SCNs Wed May 24 08:12:58 2017 SUCCESS: diskgroup DATA2 was mounted NOTE: dependency between database orcl and diskgroup resource ora.DATA2.dg is established SUCCESS: diskgroup DATA3 was mounted NOTE: dependency between database orcl and diskgroup resource ora.DATA3.dg is established Thread 1 advanced to log sequence 255944 (thread open) Thread 1 opened at log sequence 255944 Current log# 2 seq# 255944 mem# 0: +ARCDG/orcl/onlinelog/group_2.261.840376035 Current log# 2 seq# 255944 mem# 1: +ARCDG/orcl/onlinelog/group_2.2113.840376039 Current log# 2 seq# 255944 mem# 2: +ARCDG/orcl/onlinelog/group_2.2114.840376041 Current log# 2 seq# 255944 mem# 3: +ARCDG/orcl/onlinelog/group_2.2116.840376043 Successful open of redo thread 1 MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set Starting background process CTWR Wed May 24 08:12:59 2017 CTWR started with pid=39, OS id=32768642 Block change tracking service is active. Wed May 24 08:13:00 2017 SMON: enabling cache recovery Successfully onlined Undo Tablespace 2. Verifying file header compatibility for 11g tablespace encryption.. Verifying 11g file header compatibility for tablespace encryption completed SMON: enabling tx recovery Database Characterset is AL32UTF8 No Resource Manager plan active Starting background process GTX0 Wed May 24 08:13:07 2017 GTX0 started with pid=42, OS id=35062526 Starting background process RCBG Wed May 24 08:13:07 2017 RCBG started with pid=43, OS id=27721944 replication_dependency_tracking turned off (no async multimaster replication found) Wed May 24 08:13:08 2017 Starting background process QMNC Wed May 24 08:13:08 2017 QMNC started with pid=44, OS id=34013894 Completed: ALTER DATABASE OPEN Wed May 24 08:13:17 2017 Starting background process CJQ0 Wed May 24 08:13:17 2017 CJQ0 started with pid=41, OS id=11469004 Wed May 24 08:13:43 2017 Starting background process SMCO Wed May 24 08:13:43 2017 SMCO started with pid=120, OS id=42336480 , , , Wed May 24 09:08:16 2017 Errors in file /u01/app/11.2.0/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_j001_33161436.trc: ORA-12012: 鑎仟錨狮鎊骚琛^啼絕茕竈239 錨呛閊証 ORA-00001: 杩^蒎^蟐湾^辕涓^犁害鎊較哝^荨浠(CM.UK_LTE_ZY_CHECK_HISTORY) ORA-06512: 錨塄 "CM.JOB_LTE_ZY_CHECK", line 27 ORA-06512: 錨塄 line 1 ◆node2 Wed May 24 01:59:01 2017 LMS1 (ospid: 26411086) received an instance eviction notification from instance 1 [2] Wed May 24 01:59:01 2017 LMON received an instance eviction notification from instance 1 The instance eviction reason is 0x2 The instance eviction map is 2 Wed May 24 01:59:04 2017 PMON (ospid: 50725354): terminating the instance due to error 481 Wed May 24 01:59:04 2017 System state dump is made for local instance System State dumped to trace file /u01/app/11.2.0/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_diag_42336686.trc Wed May 24 01:59:04 2017 ORA-1092 : opitsk aborting process Wed May 24 01:59:04 2017 License high water mark = 516 Instance terminated by PMON, pid = 50725354 USER (ospid: 50987430): terminating the instance Instance terminated by USER, pid = 50987430 Wed May 24 01:59:29 2017 Starting ORACLE instance (normal) sskgpgetexecname failed to get name LICENSE_MAX_SESSION = 0 LICENSE_SESSIONS_WARNING = 0 Interface type 1 en1 192.168.0.0 configured from GPnP Profile for use as a cluster interconnect Interface type 1 en0 10.209.199.0 configured from GPnP Profile for use as a public interface Picked latch-free SCN scheme 3 Using LOG_ARCHIVE_DEST_1 parameter default value as USE_DB_RECOVERY_FILE_DEST Autotune of undo retention is turned on. LICENSE_MAX_USERS = 0 SYS auditing is disabled Starting up: Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production With the Partitioning, Real Application Clusters, OLAP, Data Mining and Real Application Testing options. Using parameter settings in server-side pfile /u01/app/11.2.0/oracle/product/11.2.0/db_1/dbs/initorcl2.ora System parameters with non-default values: processes = 2000 timed_statistics = TRUE sga_max_size = 70G spfile = "+DATA1/orcl/spfileorcl.ora" sga_target = 50G control_files = "+DATA1/orcl/controlfile/current.261.832429113" control_files = "+ARCDG/orcl/controlfile/current.259.832429113" db_block_size = 8192 compatible = "11.2.0.0.0" log_archive_format = "%t_%s_%r.dbf" cluster_database = TRUE db_create_file_dest = "+DATA1" db_recovery_file_dest = "+ARCDG" db_recovery_file_dest_size= 2048G thread = 2 undo_tablespace = "UNDOTBS2" undo_retention = 10800 instance_number = 2 remote_login_passwordfile= "NONE" db_domain = "" dispatchers = "(PROTOCOL=TCP) (SERVICE=orclXDB)" local_listener = "(ADDRESS = (PROTOCOL = TCP)(HOST = 10.209.199.5)(PORT = 1521))" remote_listener = "rac-scan:1521" result_cache_max_size = 268736K audit_file_dest = "/u01/app/11.2.0/oracle/admin/orcl/adump" audit_trail = "DB" db_name = "orcl" open_cursors = 1000 sql_trace = FALSE optimizer_index_caching = 90 pga_aggregate_target = 20G deferred_segment_creation= FALSE aq_tm_processes = 5 diagnostic_dest = "/u01/app/11.2.0/oracle" Deprecated system parameters with specified values: sql_trace End of deprecated system parameter listing Cluster communication is configured to use the following interface(s) for this instance 192.168.0.2 cluster interconnect IPC version:Oracle UDP/IP (generic) IPC Vendor 1 proto 2 Wed May 24 01:59:34 2017 PMON started with pid=2, OS id=66584656 Wed May 24 01:59:34 2017 VKTM started with pid=3, OS id=66846744 at elevated priority VKTM running at (10)millisec precision with DBRM quantum (100)ms Wed May 24 01:59:34 2017 GEN0 started with pid=4, OS id=26608090 Wed May 24 01:59:34 2017 DIAG started with pid=5, OS id=26083716 Wed May 24 01:59:34 2017 DBRM started with pid=6, OS id=24510972 Wed May 24 01:59:34 2017 PING started with pid=7, OS id=65077306 Wed May 24 01:59:34 2017 PSP0 started with pid=8, OS id=66781402 Wed May 24 01:59:34 2017 ACMS started with pid=9, OS id=66978040 Wed May 24 01:59:34 2017 DIA0 started with pid=10, OS id=66519050 Wed May 24 01:59:34 2017 LMON started with pid=11, OS id=66453694 Wed May 24 01:59:37 2017 LMD0 started with pid=12, OS id=23658758 Wed May 24 01:59:37 2017 LMS0 started with pid=13, OS id=66322614 at elevated priority Wed May 24 01:59:37 2017 LMS1 started with pid=14, OS id=65798210 at elevated priority Wed May 24 01:59:37 2017 LMS2 started with pid=15, OS id=15204812 at elevated priority Wed May 24 01:59:38 2017 RMS0 started with pid=16, OS id=65732820 Wed May 24 01:59:38 2017 LMHB started with pid=17, OS id=65339438 Wed May 24 01:59:38 2017 MMAN started with pid=18, OS id=52298036 Wed May 24 01:59:38 2017 DBW0 started with pid=19, OS id=52232458 Wed May 24 01:59:38 2017 DBW1 started with pid=20, OS id=65273872 Wed May 24 01:59:38 2017 DBW2 started with pid=21, OS id=65208354 Wed May 24 01:59:38 2017 DBW3 started with pid=22, OS id=65994980 Wed May 24 01:59:38 2017 DBW4 started with pid=23, OS id=61210752 Wed May 24 01:59:38 2017 LGWR started with pid=24, OS id=6095298 Wed May 24 01:59:38 2017 CKPT started with pid=25, OS id=65470648 Wed May 24 01:59:38 2017 SMON started with pid=26, OS id=64946374 Wed May 24 01:59:38 2017 RECO started with pid=27, OS id=51642854 Wed May 24 01:59:39 2017 RBAL started with pid=28, OS id=64618728 Wed May 24 01:59:39 2017 ASMB started with pid=29, OS id=64553208 Wed May 24 01:59:39 2017 MMON started with pid=30, OS id=64422054 NOTE: initiating MARK startup Wed May 24 01:59:39 2017 MMNL started with pid=31, OS id=42336716 Starting background process MARK starting up 1 dispatcher(s) for network address ‘(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))‘... Wed May 24 01:59:39 2017 MARK started with pid=32, OS id=51577168 NOTE: MARK has subscribed Wed May 24 01:59:39 2017 starting up 1 shared server(s) ... lmon registered with NM - instance number 2 (internal mem no 1) Reconfiguration started (old inc 0, new inc 232) List of instances: 1 2 (myinst: 2) Global Resource Directory frozen * allocate domain 0, invalid = TRUE Communication channels reestablished * domain 0 valid = 0 according to instance 1 Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Wed May 24 01:59:48 2017 Submitted all GCS remote-cache requests Wed May 24 02:05:10 2017 Trace dumping is performing id=[cdmp_20170524020534] Wed May 24 02:05:18 2017 Reconfiguration started (old inc 232, new inc 234) List of instances: 2 (myinst: 2) Nested reconfiguration detected. Global Resource Directory frozen * dead instance detected - domain 0 invalid = TRUE Communication channels reestablished Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out Wed May 24 02:05:19 2017 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Wed May 24 02:05:19 2017 LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Wed May 24 02:05:19 2017 LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Post SMON to start 1st pass IR Submitted all GCS remote-cache requests Post SMON to start 1st pass IR Fix write in gcs resources Reconfiguration complete Wed May 24 02:05:25 2017 LCK0 started with pid=34, OS id=51118396 Wed May 24 02:05:25 2017 Starting background process RSMN Wed May 24 02:05:25 2017 RSMN started with pid=36, OS id=51773786 ORACLE_BASE not set in environment. It is recommended that ORACLE_BASE be set in the environment Reusing ORACLE_BASE from an earlier startup = /u01/app/11.2.0/oracle Wed May 24 02:05:25 2017 ALTER DATABASE MOUNT This instance was first to mount Wed May 24 02:05:26 2017 NOTE: Loaded library: System Wed May 24 02:05:26 2017 SUCCESS: diskgroup DATA1 was mounted Wed May 24 02:05:26 2017 NOTE: dependency between database orcl and diskgroup resource ora.DATA1.dg is established SUCCESS: diskgroup ARCDG was mounted NOTE: dependency between database orcl and diskgroup resource ora.ARCDG.dg is established Wed May 24 02:05:30 2017 Successful mount of redo thread 2, with mount id 1472569957 Database mounted in Shared Mode (CLUSTER_DATABASE=TRUE) Lost write protection disabled Completed: ALTER DATABASE MOUNT ALTER DATABASE OPEN This instance was first to open Wed May 24 02:05:30 2017 SUCCESS: diskgroup DATA2 was mounted NOTE: dependency between database orcl and diskgroup resource ora.DATA2.dg is established SUCCESS: diskgroup DATA3 was mounted NOTE: dependency between database orcl and diskgroup resource ora.DATA3.dg is established Block change tracking file is current. Beginning crash recovery of 2 threads parallel recovery started with 32 processes Started redo scan Wed May 24 02:05:44 2017 Completed redo scan read 2158203 KB redo, 230150 data blocks need recovery Started redo application at Thread 1: logseq 255942, block 163388 Thread 2: logseq 236286, block 2067920 Recovery of Online Redo Log: Thread 1 Group 3 Seq 255942 Reading mem 0 Mem# 0: +ARCDG/orcl/onlinelog/group_3.2085.840375427 Mem# 1: +ARCDG/orcl/onlinelog/group_3.2084.840375445 Mem# 2: +ARCDG/orcl/onlinelog/group_3.263.840375447 Mem# 3: +ARCDG/orcl/onlinelog/group_3.2089.840375449 Recovery of Online Redo Log: Thread 2 Group 7 Seq 236286 Reading mem 0 Mem# 0: +ARCDG/orcl/onlinelog/group_7.2049.840374587 Mem# 1: +ARCDG/orcl/onlinelog/group_7.2052.840374591 Mem# 2: +ARCDG/orcl/onlinelog/group_7.2057.840374593 Mem# 3: +ARCDG/orcl/onlinelog/group_7.2058.840374597 Recovery of Online Redo Log: Thread 2 Group 5 Seq 236287 Reading mem 0 Mem# 0: +ARCDG/orcl/onlinelog/group_5.2025.840374445 Mem# 1: +ARCDG/orcl/onlinelog/group_5.2026.840374483 Mem# 2: +ARCDG/orcl/onlinelog/group_5.2029.840374489 Mem# 3: +ARCDG/orcl/onlinelog/group_5.2034.840374493 Wed May 24 02:06:05 2017 Completed redo application of 1702.07MB Completed crash recovery at Thread 1: logseq 255942, block 291036, scn 12495715727030 Thread 2: logseq 236287, block 2742038, scn 12495715600520 230150 data blocks read, 229989 data blocks written, 2158203 redo k-bytes read Thread 1 advanced to log sequence 255943 (thread recovery) Picked broadcast on commit scheme to generate SCNs Wed May 24 02:06:06 2017 Thread 2 advanced to log sequence 236288 (thread open) Thread 2 opened at log sequence 236288 Current log# 6 seq# 236288 mem# 0: +ARCDG/orcl/onlinelog/group_6.2040.840374575 Current log# 6 seq# 236288 mem# 1: +ARCDG/orcl/onlinelog/group_6.2041.840374579 Current log# 6 seq# 236288 mem# 2: +ARCDG/orcl/onlinelog/group_6.2042.840374581 Current log# 6 seq# 236288 mem# 3: +ARCDG/orcl/onlinelog/group_6.2048.840374585 Successful open of redo thread 2 MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set Starting background process CTWR Wed May 24 02:06:07 2017 CTWR started with pid=89, OS id=49349114 Block change tracking service is active. Wed May 24 02:06:07 2017 SMON: enabling cache recovery Successfully onlined Undo Tablespace 4. Verifying file header compatibility for 11g tablespace encryption.. Verifying 11g file header compatibility for tablespace encryption completed SMON: enabling tx recovery Database Characterset is AL32UTF8 No Resource Manager plan active Starting background process GTX0 Wed May 24 02:06:11 2017 GTX0 started with pid=98, OS id=42992042 Starting background process RCBG Wed May 24 02:06:11 2017 RCBG started with pid=52, OS id=40370648 replication_dependency_tracking turned off (no async multimaster replication found) Starting background process QMNC Wed May 24 02:06:13 2017 QMNC started with pid=119, OS id=47514086 Completed: ALTER DATABASE OPEN SMON: Parallel transaction recovery tried Starting background process SMCO Wed May 24 02:06:17 2017 SMCO started with pid=153, OS id=50724962 Wed May 24 02:06:17 2017 db_recovery_file_dest_size of 2097152 MB is 4.35% used. This is a user-specified limit on the amount of space that will be used by this database for recovery-related files, and does not reflect the amount of space available in the underlying filesystem or ASM diskgroup. Wed May 24 02:06:19 2017 Starting background process CJQ0 Wed May 24 02:06:19 2017 CJQ0 started with pid=46, OS id=43581760 Wed May 24 02:07:35 2017 Thread 2 advanced to log sequence 236289 (LGWR switch) Current log# 7 seq# 236289 mem# 0: +ARCDG/orcl/onlinelog/group_7.2049.840374587 Current log# 7 seq# 236289 mem# 1: +ARCDG/orcl/onlinelog/group_7.2052.840374591 Current log# 7 seq# 236289 mem# 2: +ARCDG/orcl/onlinelog/group_7.2057.840374593 Current log# 7 seq# 236289 mem# 3: +ARCDG/orcl/onlinelog/group_7.2058.840374597 Wed May 24 02:08:09 2017 Thread 2 advanced to log sequence 236290 (LGWR switch) Current log# 5 seq# 236290 mem# 0: +ARCDG/orcl/onlinelog/group_5.2025.840374445 Current log# 5 seq# 236290 mem# 1: +ARCDG/orcl/onlinelog/group_5.2026.840374483 Current log# 5 seq# 236290 mem# 2: +ARCDG/orcl/onlinelog/group_5.2029.840374489 Current log# 5 seq# 236290 mem# 3: +ARCDG/orcl/onlinelog/group_5.2034.840374493 Wed May 24 02:08:51 2017 Thread 2 advanced to log sequence 236291 (LGWR switch) Current log# 6 seq# 236291 mem# 0: +ARCDG/orcl/onlinelog/group_6.2040.840374575 Current log# 6 seq# 236291 mem# 1: +ARCDG/orcl/onlinelog/group_6.2041.840374579 Current log# 6 seq# 236291 mem# 2: +ARCDG/orcl/onlinelog/group_6.2042.840374581 Current log# 6 seq# 236291 mem# 3: +ARCDG/orcl/onlinelog/group_6.2048.840374585 Wed May 24 02:09:30 2017 Thread 2 advanced to log sequence 236292 (LGWR switch) Current log# 7 seq# 236292 mem# 0: +ARCDG/orcl/onlinelog/group_7.2049.840374587 Current log# 7 seq# 236292 mem# 1: +ARCDG/orcl/onlinelog/group_7.2052.840374591 Current log# 7 seq# 236292 mem# 2: +ARCDG/orcl/onlinelog/group_7.2057.840374593 Current log# 7 seq# 236292 mem# 3: +ARCDG/orcl/onlinelog/group_7.2058.840374597
针对这个RAC集群,做了加多的配置修正、优化,主要是如下所述。
◆数据库配置优化
0、前期大量的日志切换无法完成,导致数据库挂起,为此新增了loggroup,并删掉无用多余的logmember。
1、增大了ASM实例的内存,由350M--->2G
2、RAC并行进程强制在本地实例分配
3、为减少对性能的影响,关闭audit审计
4、AWR 资料保存时间由7天增加到30天
5、SGA增大10G至60G
◆修改网络相关的系统核心参数
/usr/sbin/no -p -o tcp_sendspace=4194304
/usr/sbin/no -p -o tcp_recvspace=4194304
/usr/sbin/no -p -o rfc1323=1
/usr/sbin/no -p -o sb_max=8388608
/usr/sbin/no -p -o udp_ephemeral_low=9000
/usr/sbin/no -p -o tcp_ephemeral_low=9000
◆修改内联网卡的参数
chdev -l en1 -a tcp_sendspace=1048576 -a rfc1323=1 -a tcp_recvspace=1048576
◆SWAP
物理内存大于 16GB,但 SWAP 空间低于 16 GB,不符合 Oracle 安装的标准建议。增大PS。chps -s 192 hd6
◆时区及时间确认,安全起见,暂不修改。
◆ntp确认,使用GI集群自身的时间同步,确保各个节点的时间同步,目前是几十秒的偏差,不是问题。
◆HA集群配置修改,当初安装集群时部分内容没有按照官方文档实施,本次做了补救,但不全面,可能仍存在未知的风险。
5月25日重启后,经多一段时间的观察,有所改善,但于6月7日9点又发生一次节点2当。
但此次已经不是节点的驱逐,说明上次的修正与配置优化起到了作用,初步断定与网络参数配置修改相关。日志如下,无evict日志。
Wed Jun 07 09:02:57 2017 Errors in file /u01/app/11.2.0/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_lms0_7930030.trc (incident=1023882): ORA-00600: internal error code, arguments: [kjbrref:pkey], [3881577], [6], [7493495], [0], [], [], [], [], [], [], [] Incident details in: /u01/app/11.2.0/oracle/diag/rdbms/orcl/orcl2/incident/incdir_1023882/orcl2_lms0_7930030_i1023882.trc Wed Jun 07 09:02:59 2017 Trace dumping is performing id=[cdmp_20170607090259] Errors in file /u01/app/11.2.0/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_lms0_7930030.trc: ORA-00600: internal error code, arguments: [kjbrref:pkey], [3881577], [6], [7493495], [0], [], [], [], [], [], [], [] LMS0 (ospid: 7930030): terminating the instance due to error 484 Instance terminated by LMS0, pid = 7930030 Wed Jun 07 09:14:01 2017 Starting ORACLE instance (normal)
针对此次的600内部错误,初步结论:
5月25日优化重启后,6月7日又发生过1次节点down的情况,经排查,基本确认是oracle BUG,在某些情况下会发生节点down。目前数据库PSU版本11.2.0.1.0,需要打补丁到最新版本11.2.0.1.6。后续需择机打补丁。
11G RAC 11.2.0.1.0实例evict故障处理