In this series, I will discuss AHA! Moments I experienced while solving technical challenges. Each blog in this series will contain this section followed by Format of this blog and then finally details on AHA! Moment.
The format of this blog will be as follows:
AHA! Moment: This will explain the lesson learnt. I will try to keep this very short. This would be a “do’s/don’t’s” type of statement.
Why: This section will give details why the moment was AHA! moment.
Example: I will give an example that helped me experience AHA! Moment.
In case you see “unavailabe NFS file error” (see screenshot below) in online backup logs, you can either:
- ignore that error safely or
- adjust NFS_ACCESS_TIMEOUT parameter or
- take online backup when the system is less active as recommended by Oracle (or any database vendor).
Error Message from the online backup log:
As you can see, the error message appears to imply that there was some serious issue in accessing one of Oracle Data Files. This situation not only led to unsuccessful backups but also raised a concern that the datbase consistency might have been compromised. Secondly, the data file is not on NFS so we were not sure why backup tool reported “Unavailable NFS file system”. As a result, I performed an analysis and found that the system was very busy when online backup was running and that an active database checkpoint was in progress when the backup tool reported that error.
What is a checkpoint? and why an active checkpoint caused “Unavailable NFS file” error?
As you probably know, when users make changes to data, the database doesn’t write those changes immediately to the disk. The database would write changes to the data buffers in SGA(System Global Area) first and then would flush them to disk at an appropriate time(I don’t want to go into the details of what would trigger this event of flushing “dirty” buffers). This activity of flushing is called a Checkpoint. During checkpoint, the database will acquire an exclusive lock on data files having “modified data” in SGA.
The backup tool, prior to reading data, attempted to acquire a lock on data file. As checkpoint was in progress, the backup tool could not acquire the lock so displayed a generic error message.
1) Database : Oracle 10.2.0.4 O/S: Solaris 10.5; Backup tool: Veritas Netbackup
- Backup log reported(see the screenshot above) that it was ready to start backing up data file sr3.data26 at 4:26:38
- At 4:26:49,11 seconds later, the backup log reported the error.
- Database log analysis showed(see the screenshot below) that a checkpoint started at 4:26:31
- The checkpoint didn’t complete until 4:26:52.
- What we found was that the default value for NFS_ACCESS_TIMEOUT is 5 seconds. Not sure why the backup tool waited for 11 seconds; we changed the value for NFS_ACCESS_TIMEOUT to 30 seconds; we are not getting this error any more.