These days everything goes to the cloud or it has been collocated somewhere in a shared infrastructure. In this post I’ll talk about sessions being disconnected from your databases, firewalls and dead connection detection.
Changes
We moved number of 11g databases from one data centre to another.
Symptoms
Now probably many of you have seen the following error in your database alertlog «TNS-12535: TNS:operation timed out» or if you haven’t you will definitely see it some day.
Consider the following error from database alert log:
Fatal NI connect error 12170.
VERSION INFORMATION:
TNS for Linux: Version 11.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.3.0 - Production
Time: 12-MAR-2015 10:28:08
Tracing not turned on.
Tns error struct:
ns main err code: 12535
TNS-12535: TNS:operation timed out
ns secondary err code: 12560
nt main err code: 505
TNS-00505: Operation timed out
nt secondary err code: 110
nt OS err code: 0
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.10)(PORT=49831))
Thu Mar 12 10:28:09 2015
Now this error indicate timing issues between the server and the client. It’s important to mention that those errors are RESULTANT, they are informational and not the actual cause of the disconnect. Although this error might happen for number of reasons it is commonly associated with firewalls or slow networks.
Troubleshooting
The best way to understand what’s happening is to build a histogram of the duration of the sessions. In particular we want to understand whether disconnects are sporadic and random or they follow a specific pattern.
To do so you need to parse the listener log and locate the following line from the above example:
(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.10)(PORT=49831))
Since the port is random you might not get same record or if you do it might be days apart.
Here’s what I found in the listener:
12-MAR-2015 08:16:52 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=ORCL)(CID=(PROGRAM=app)(HOST=apps01)(USER=scott))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.10)(PORT=49831)) * establish * ORCL * 0
In other words — at 8:16 the user scott established connection from host 192.168.0.10.
Now if you compare both records you’ll get the duration of the session:
Established: 12-MAR-2015 08:16:52
Disconnected: Thu Mar 12 10:28:09 2015
Here are couple of other examples:
alertlog:
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.10)(PORT=20620))
Thu Mar 12 10:31:20 2015
listener.log:
12-MAR-2015 08:20:04 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=ORCL)(CID=(PROGRAM=app)(HOST=apps01)(USER=scott))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.10)(PORT=20620)) * establish * ORCL * 0
alertlog:
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.10)(PORT=48157))
Thu Mar 12 10:37:51 2015
listener.log:
12-MAR-2015 08:26:36 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=ORCL)(CID=(PROGRAM=app)(HOST=apps01)(USER=scott))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.10)(PORT=48157)) * establish * ORCL * 0
alertlog:
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.11)(PORT=42618))
Tue Mar 10 19:09:09 2015
listener.log
10-MAR-2015 16:57:54 * (CONNECT_DATA=(CID=(PROGRAM=)(HOST=__jdbc__)(USER=root))(SERVICE_NAME=ORCL1)(SERVER=DEDICATED)) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.11)(PORT=42618)) * establish * ORCL1 * 0
As you may have noticed the errors follow very strict pattern — each one gets disconnect exactly 2hrs 11mins after it has been established.
Cause
Given the repetitive behaviour of the issue and that it happened for multiple databases and application servers we can conclude that’s definitely a firewall issue.
The firewall recognizes the TCP protocol and keeps a record of established connections and it also recognizes TCP connection closure packets (TCP FIN type packet). However sometimes the client may abruptly end communication without closing the end points properly by sending FIN packet in which case the firewall will not know that the end-points will no longer use the opened channel. To resolve this problem firewall imposes a BLACKOUT on those connections that stay idle for a predefined amount of time.
The only issues with BLACKOUT is that neither or the sides will be notified.
In our case the firewall will disconnect IDLE sessions after around 2hrs of inactivity.
Solution
The solution for database server is to use Dead Connection Detection (DCD) feature. DCD detects when a connection has terminated unexpectedly and flags the dead session so PMON can release the resources associated with it.
DCD sets a timer when a session is initiated and when the timer expires SQL*Net on the server sends a small 10 bytes probe packet to the client to make sure connection is still active. If the client has terminated unexpectedly the server will get an error and the connection will be closed and the associated resources will be released. If the connection is still active then the probe packet is discarded and the timer is reset.
To enable DCD you need to set SQLNET.EXPIRE_TIME in sqlnet.ora of you RDBMS home!
cat >> $ORACLE_HOME/network/admin/sqlnet.ora
SQLNET.EXPIRE_TIME=10
This will set the timer to 10 minutes. Remember that sessions need to reconnect for the change to take place it won’t work for existing connections.
Firewalls become smarter and they can now inspect packages even deeper. Make sure the following settings are also disabled:
- SQLNet fixup protocol
- Deep Packet Inspection (DPI)
- SQLNet packet inspection
- SQL Fixup
I had similar issue with Dataguard already, read more here:
Smart Firewalls
How to test Dead Connection Detection
You might want to test or make sure DCD really works. You’ve got multiple options here — Oracle SQL client trace, Oracle SQL Server Trace, Sniff the network with packet analyzer OR using strace to trace the server process. I used strace since I had access to the database server and it was non intrusive.
-
Establish a connection to the database through SQL*Net
-
Find the processes number for your session:
SQL> select SPID from v$process where ADDR in (select PADDR from v$session where username='SVE');
SPID
------------------------
62761
- Trace the process:
[oracle@dbsrv ~]$ strace -tt -f -p 62761
Process 62761 attached - interrupt to quit
11:36:58.158348 --- SIGALRM (Alarm clock) @ 0 (0) ---
11:36:58.158485 rt_sigprocmask(SIG_BLOCK, [], NULL, 8) = 0
....
11:46:58.240065 --- SIGALRM (Alarm clock) @ 0 (0) ---
11:46:58.240211 rt_sigprocmask(SIG_BLOCK, [], NULL, 8) = 0
...
11:46:58.331063 write(20, "n620", 10) = 10
...
What I did was to attach to the process, simulate some activity at 11:36 and then leave the session IDLE. Then 10 minutes later the server process sent an empty packet to the client to check if the connection is still alive.
Conclusion
Errors in alertlog disappeared after I enabled the DCD.
Make sure to enable DCD if you host your databases in a shared infrastructure or there are firewalls between your database and application servers.
References
How to Check if Dead Connection Detection (DCD) is Enabled in 9i ,10g and 11g (Doc ID 395505.1)
Alert Log Errors: 12170 TNS-12535/TNS-00505: Operation Timed Out (Doc ID 1628949.1)
Resolving Problems with Connection Idle Timeout With Firewall (Doc ID 257650.1)
Dead Connection Detection (DCD) Explained (Doc ID 151972.1)
I’m seeing a lot of queries from a .Net data access server I maintain timing out. It appears to be totally random with no relationship to data or data locks. For example, the following query timed out!
SELECT NULL FROM DUAL
The system logs show that when it happened CPU was at 20%, memory 42%, Disk 3%. What is going on?
The DB is version 10.2.0.3.0 on HPUX.
The ODP driver is 2.111.6.20 (11g driver)
I checked out the sqlnet.log and found a large number of these error messages:
***********************************************************************
Fatal NI connect error 12170.
VERSION INFORMATION:
TNS for HPUX: Version 10.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for HPUX: Version 10.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for HPUX: Version 10.2.0.3.0 - Production
Time: 29-JUN-2009 06:42:04
Tracing not turned on.
Tns error struct:
ns main err code: 12535
TNS-12535: TNS:operation timed out
ns secondary err code: 12606
nt main err code: 0
nt secondary err code: 0
voretaq7
79.5k17 gold badges130 silver badges214 bronze badges
asked Jun 19, 2009 at 17:08
1
The particular error and the fact that it happens to all applications running against the database would strongly point to a network hiccup as the source of the problem.
-
How are TNS aliases resolved? Are
you using a local tnsnames.ora file? -
Assuming you are using a local
tnsnames.ora file, is the TNS alias
for the database using an IP address
or a host name? Using an IP address
eliminates the need to hit DNS, so it
may be worthwhile to try that in case
the problem is that your DNS server
is briefly going nuts. - You may also try configuring a backup listener and adding a failover option to the TNS alias. If the problem is that the network hiccups and loses the packets from the client to the listener communication randomly, having a failover option that can be tried may resolve the vast majority of issues without needing to figure out what piece of the network is going flaky. Of course, that assumes that the problem corrects itself quickly enough that the next connection attempt succeeds, but that may well be a reasonable assumption. If adding a backup listener resolves the problem, you can be all but certain that it’s a network issue.
answered Jun 29, 2009 at 18:41
Justin CaveJustin Cave
9887 silver badges11 bronze badges
I’ve found that turning off connection pooling is more reliable. I ran into several situations like you describe.
It turned out that the connections are tested BEFORE they are placed back into the pool, instead of as they’re being pulled for subsequent use.
On a good day, you’ll get an error of some sort. On a bad day, you’ll just hang waiting to use a connection that is completely messed up and will never work.
If you’re going to use pooling, I read recommendations that you execute an alter command against the session. (No, I don’t know of a good one. Oracle didn’t make it any clearer in the manual)
Brad
answered Jun 19, 2009 at 18:00
Brad BruceBrad Bruce
6168 silver badges17 bronze badges
2
I had this issue when the following conditions were true:
— JDBC Client was running on a machine with IP ZZ.ZZ.ZZ.ZZ
— Database Server had two NICs — one with IP XX.XX.XX.XX and another with YY.YY.YY.YY
— JDBC client URL was pointing to IP XX.XX.XX.XX, port 1521
— Using routing table, the client ZZ.ZZ.ZZ.ZZ was able to reach XX.XX.XX.XX
— The default «LISTENER» was listening on YY.YY.YY.YY port 1521, (hostname resolves to this IP)
— The LOCAL_LISTENER parametyer in SPFILE was NULL — that is it was never set
I resolved this by doing the following:
— Stopped the listener (lsnrctl stop)
— Changed LISTENER.ORA to listen on XX.XX.XX.XX (instead of the hostname default of YY.YY.YY.YY)
— Added an entry in TNSNAMES.ORA to set the local listener (LISTENER = (…)), essentially a copy of the entry used in LISTENER.ORA)
— Added a parameter LOCAL_LISTENER = LISTENER in spfile (ALTER SYSTEM SET … SCOPE=SPFILE)
— Restarted the LISTENER (lsnrctl start)
— Restarted the database
answered Jan 25, 2013 at 23:58
There’s not really enough information to come up with a solution, but here are a few things to try.
Are you able to reproduce this outside the app? Do you ever see failures with tnsping? Can you ping the server reliably?
Is the app running multiple connections in parallel? Are there limits to the number of connections?
What about the network — is there a firewall between the app and the db?
answered Jun 26, 2009 at 14:21
chrischris
3,9536 gold badges27 silver badges35 bronze badges
4
Although this is treating the symptom rather than whatever network stack disease you have, you could try increasing the
inbound_connect_timeout to see if the problem goes away.
To check what you’re running now, invoke LSNRCTL on the database host and issue the command:
show inbound_connect_timeout
You have to change this in your db host’s sqlnet.ora and listener.ora files:
sqlnet.ora:
SQLNET.INBOUND_CONNECT_TIMEOUT = 100 (assuming 100 is greater than your current timeout)
listener.ora:
INBOUND_CONNECT_TIMEOUT_yourLIstenerNameGoesHere = 100
answered Jun 30, 2009 at 19:04
Check also your firewall. Inbound port 1521 on the database server should be open (if you are connecting to the DB server from another server) .
answered Jun 12, 2015 at 13:24
I have just managed to resolve this issue, I have figured out that if you are running on DHCP server, make sure your host file has the current correct IP address, C:windowssystem32driversetchost
(open it using notepad) & this will fix your problem.
chicks
3,73410 gold badges27 silver badges36 bronze badges
answered Mar 10, 2017 at 16:34
Additional Tips: Please check your server/PC firewall settings. On some windows operating system such as Vista/Win7/Win Server; firewall is enable for SOHO network computers. Make sure to test all network groups private/office/public.
Then stop the firewall service and test for the connection.
answered Jun 2, 2017 at 19:33
YaredYared
111 bronze badge
My goal is to connect to an Oracle 9i instance from my OS X machine. I’ve followed the setup instructions here and got through them with no errors (eventually). However, I’m finding that sqlplus is unable to connect:
[ ethan@gir ~ ]$ sqlplus xxx/yyy@zzz
SQL*Plus: Release 10.2.0.4.0 - Production on Fri Apr 17 10:13:08 2009
Copyright (c) 1982, 2007, Oracle. All Rights Reserved.
Looooong wait…
ERROR:
ORA-12170: TNS:Connect timeout occurred
Enter user-name: xxx
Enter password:
ERROR:
ORA-12162: TNS:net service name is incorrectly specified
Enter user-name:
My tnsnames.ora
file…
zzz =
(DESCRIPTION =
(ADDRESS_LIST =
(ADDRESS =
(PROTOCOL = TCP)
(HOST = dbhost)
(PORT = 1521))
)
(CONNECT_DATA =
(SERVICE_NAME = zzz)
)
)
Maybe there’s an env variable that needs to be set?
UPDATE
Able to ping DB host machine no problem.
Tried…
sqlplus xxx/yyy@//dbhost/zzz
Got…
ERROR:
ORA-12170: TNS:Connect timeout occurred
Tried using SID
instead of SERVICE_NAME
in tnsnames.ora. Did not seem to change the result. Reverted back to SERVICE_NAME
.
Last couple entries in sqlnet.log…
***********************************************************************
Fatal NI connect error 12170.
VERSION INFORMATION:
TNS for MacOS X Server: Version 10.2.0.4.0 - Production
TCP/IP NT Protocol Adapter for MacOS X Server: Version 10.2.0.4.0 - Production
Time: 17-APR-2009 10:33:06
Tracing not turned on.
Tns error struct:
ns main err code: 12535
TNS-12535: Message 12535 not found; No message file for product=network, facility=TNS
ns secondary err code: 12560
nt main err code: 505
TNS-00505: Message 505 not found; No message file for product=network, facility=TNS
nt secondary err code: 60
nt OS err code: 0
***********************************************************************
Fatal NI connect error 12170.
VERSION INFORMATION:
TNS for MacOS X Server: Version 10.2.0.4.0 - Production
TCP/IP NT Protocol Adapter for MacOS X Server: Version 10.2.0.4.0 - Production
Time: 17-APR-2009 11:24:08
Tracing not turned on.
Tns error struct:
ns main err code: 12535
TNS-12535: Message 12535 not found; No message file for product=network, facility=TNS
ns secondary err code: 12560
nt main err code: 505
TNS-00505: Message 505 not found; No message file for product=network, facility=TNS
nt secondary err code: 60
nt OS err code: 0
PARTIAL ANSWER
Thanks everyone for your answers. They were helpful. I found that there was a DNS issue. I was able to ping by hostname, so thought that should work fine. I also tried I.P. address. Turned out that I needed the internal «10.1.x.x» I.P. address for it to work on this OS X machine (but hostname is fine on Windows).
At this point, I can connect with…
sqlplus xxx/yyy@//INTERNAL_IP/zzz
However, with those values entered into tnsnames.ora, this still doesn’t work…
sqlplus xxx/yyy@zzz
…
ORA-12154: TNS:could not resolve the connect identifier specified
I searched for a sample tnsnames.ora file that was close to what I needed and copied the contents into my file. Changed the params and now everything works. Not sure why mine wasn’t working.
November 3, 2009
TNS-12535 is message you can find in the sqlnet.log file of a database server, such as this:
***********************************************************************
Fatal NI connect error 12170.
VERSION INFORMATION:
TNS for Solaris: Version 10.2.0.1.0 – Production
Oracle Bequeath NT Protocol Adapter for Solaris: Version 10.2.0.1.0 – Production
TCP/IP NT Protocol Adapter for Solaris: Version 10.2.0.1.0 – Production
Time: 09-MAR-2007 18:44:34
Tracing not turned on.
Tns error struct:
ns main err code: 12535
TNS-12535: TNS:operation timed out
ns secondary err code: 12606
nt main err code: 0
nt secondary err code: 0
nt OS err code: 0
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=10.221.70.83)(PORT=4155))
***********************************************************************
I have seen two flavors of TNS-12535. One flavor, like the one above, actually will get logged in the alert log as this:
WARNING: inbound connection timed out (ORA-3136)
Oracle documentation says this is related to a user making a request for a connection, but did not pass authentication information within the amount of time set.
The other flavor of TNS-12535 looks like this:
***********************************************************************
Fatal NI connect error 12170.
VERSION INFORMATION:
TNS for Solaris: Version 10.2.0.1.0 – Production
Oracle Bequeath NT Protocol Adapter for Solaris: Version 10.2.0.1.0 – Production
TCP/IP NT Protocol Adapter for Solaris: Version 10.2.0.1.0 – Production
Time: 13-MAR-2007 18:42:58
Tracing not turned on.
Tns error struct:
ns main err code: 12535
TNS-12535: TNS:operation timed out
ns secondary err code: 12560
nt main err code: 505
TNS-00505: Operation timed out
nt secondary err code: 145
nt OS err code: 0
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=10.221.70.83)(PORT=4032))
***********************************************************************
Notice the “ns secondary err code” and “nt secondary err code” are different from the first flavor. This flavor does NOT get logged in the alert log, but is indicative of a client connection being dropped (as per this blog: http://hpdba.wordpress.com/tag/keep-alive/). This can be the result of firewall dropping connections per the firewall policy of idle sessions. The workaround for this can be done as follows:
“For Oracle databases, SQLNET.EXPIRE_TIME can be used. Although SQLNET.EXPIRE_TIME was intended as a dead connection detection (DCD) method, a side effect of sending probe packets between the database server and client/application server is that the connection remains active. Set the SQLNET.EXPIRE_TIME to less than the firewall timeout and the problem is solved.”