Posted: March 18th, 2023
Real Time and Fault Tolerant Systems
The Quest for Zero Downtime:
Since the dawn of the Internet, the need for application availability and reliability has continually increased over time. This need is especially strong for the military, aerospace, and aircraft control industries, where any amount of downtime can have fatal consequences. In the 1998 case study titled, “NCAPS: Application High Availability in UNIX Computer Clusters,” by Luiz a. Laranjeira, Tandem Computers developed a specialized software system that can run on Unix computer clusters while providing a superior level of application availability. This essay offers a critique of the case study as well as of the software architecture and fault tolerance strategies used.
Since the dawn of the Internet, application availability has increased immensely. However, at the time of the above case study, there was still a need to improve the recovery times of existing high availability solutions, especially concerning real-time critical applications. Recovery times were too long, expensive, and unreliable, lasting anywhere between one minute and an hour. Therefore, the key design goal of the NCAPS system was to ultimately provide continuous availability of real-time critical systems in the event of hardware, software, or operating system faults. Also, by helping to significantly shrink recovery times of large-scale applications, the NCAPS design could not only ensure that these vital systems would remain up and running, but help reduce the hefty costs associated with downtime.
NCAPS provides specialized system software that runs on a Unix computer cluster with two or more nodes. According to some industry experts, this is the minimum requirement for a high availability cluster. Additionally, the system can provide more rapid failover because it is based on a primary/backup scheme, where two instances of an application are running at the same time.
As described in the case study, the NCAPS software architecture includes the Node Status Monitor (NSM), the Keepalive (KpA), the Process Pairs Manager (PPM), the Open Fault Tolerance Library (OftLib), and the Command Line Interface (CLI). The NSM, KpA, and PPM are replicated in both nodes and interact through continuous monitoring and message communication. The state of the two nodes is monitored by the NSM. The KpA keeps an eye on registered processes and uses a script to restart them in the case of failure.
More important, the PPM is the core of the NCAPS system and it starts, monitors, and manages application processes through the use of a process pairs paradigm. Plus, the PPM state model can be configured by the user, which is a key competitive advantage over other high availability software vendors.
Fault Tolerance Strategies Used
Redundancy has been defined as the duplication of critical components of a system with the intention of increasing reliability of the system, usually in the case of a backup or fail-safe (Answers.com, 2010). The two nodes of the NCAPS system offer redundancy because they mirror each other, always providing one node in primary status and the other in backup status. If one fails, the other is available to take over.
Redundancy in the NCAPS system can also be found in the NSM, where “heartbeats” are exchanged between the two NSMs. “When one NSM does not receive a configurable number of heartbeats from the other within a configurable period of time, it sends a node-down message to its subscribers (the PPM only). When the other node is restarted and the two NSMs resume exchanging heartbeats, the NSM sends a node-up message to its subscribers,” (Laranjeira, 1998 p. 442).
Always ready to switch from a backup to a primary state, the PPM provides redundancy as well, “One instance of the PPM and of the watched application run in each of two nodes of a cluster. In one node an instance of the application is in a primary state and is providing service. In the other node another instance of the application is in a backup state. A backup application is not providing service, but it is initialized and ready to take over in case of a failure of the primary application or of its node,” (Laranjeira, 1998 P. 442).
Described as a highly available service implemented as a primary and shadow instance, the Keepalive component offers further redundancy, “These two instances send heartbeats to each other and share information through a memory mapped file. If the shadow instance dies, the primary restarts it. If the primary instance dies, the shadow instance becomes primary, takes control of the memory mapped file, and spawns another shadow instance,” (Laranjeira, 1998 p. 442).
Information on fault/error isolation and containment of the NCAPS system was not clearly disclosed in the case study. Specifically regarding the PPM and application processes, however, the following was stated, “In failure situations, the PPM executes a cleanup script and restarts the application to a maximum configurable number of times. The cleanup script ensures that all application processes have exited before the application is restarted,” (Laranjeira, 1998 p. 443). In addition, as you’ll see below, the Hang Detection Service will “kill” an offending process if a hang is detected. But otherwise, relatively no information was provided on how faults or errors are contained.
In the case study, there is little-to-no definition of the types of faults the system is detecting, whether they are transient, permanent, or intermittent. Based on background course material, this may not be a good thing, “In an ultra-reliable system, it is essential to have error detection and recovery mechanisms designed to handle transient faults. These mechanisms must be able to distinguish transient faults from permanent or intermittent faults, so that when a transient fault is detected in a unit the unit is not discarded,” (Course Objectives
). That said, it does not mean that the NCAPS system is unable to distinguish between the different types of faults, it just raises questions because this essential information was not provided in the case study.
Fault detection functionality in the NCAPS system can be found in the PPM, “When an application process fails, the PPM detects it and restarts it up to a maximum configurable number of times. After this threshold is exceeded, the next failure of that process will imply in a failure of the application,” (Laranjeira, 1998 p. 443). Also, the Application Administration (AAD), a key component of the PPM, provides fault detection by mediating the interactions between the Application State Model (ASM) and the application. The AAD detects an application event, such as a failure, and directs it into the ASM. After a state change takes place, an ASM action triggers the AAD to send a state change command message to the application processes (Laranjeira, 1998 p. 446).
Part of the functionality provided by the Open Fault Tolerance Library (OftLib) is the Hang Detection Service (HDS), which offers the capability to detect faults that cause a process to “hang.” By using heartbeats with specified time intervals, the Hang Detection Service detects when a heartbeat is not received when expected and thus responds with the appropriate action. At this point, HDS simply ends the offending process. Keepalive then detects that the process no longer exists and the appropriate recovery mechanisms are triggered (Laranjeira, 1998 p. 444).
There was relatively little information provided on the system reconfiguration of the NCAPS system. Regarding the PPM, some functionality is configurable and allows users to define and execute their own scripts during a state change. Therefore, during specific state changes, the user may determine which actions should be applied, and may include the transfer of resources in the event of application failover or the trigger of an alarm due to a specified state change (Laranjeira, 1998 p. 443).
System recovery within the NCAPS system is typically handled by the PPM or the Command Line Interface (CLI). With regard to the PPM, “In failure situations, the PPM executes a cleanup script and restarts the application up to a maximum configurable number of times. The cleanup script ensures that all application processes have exited before the application is restarted.” (Laranjeira, 1998 p. 443). In addition, the CLI provides system administrators with the control they need to perform a range of capabilities, including the ability “to query the application’s state or to manually cause the application to failover, become primary, reinitialize, inhibit the failover function, (when the application is in the backup state), un-inhibit the failover function, startup or shutdown,” (Laranjeira, 1998 p. 444).
Overall, it appears that the NCAPS system is a highly effective solution that is built on a solid, logical architecture. From the primary/backup design to the PPM, NSM, and Keepalive components, redundancy is prevalent throughout the system and helps to provide high availability, resiliency, and security. Also, multiple components help monitor the system, applications, and application processes as well as allow communication between the various components. Plus, the PPM, AAD, and HDS can detect faults by monitoring system heartbeats, errors, and potential failures. System reconfiguration can be defined by the user and system recovery can be handled by the PPM and the CLI. Because the system has a lot of user-defined capabilities, users gain the flexibility to configure the system to meet their specific needs.
While there was a lot of detailed information in the case study, there were some information gaps. A definition of the types of faults the system detects, such as transient, permanent, or intermittent, and how the system handled the different faults would have been helpful. Also, knowing how the faults and errors were then isolated and contained would have been useful.
Since the case study was written in 1998, it would be interesting to see where the product and functionality is at today. The desire and need for highly available systems has only increased over time and it appears that the NCAPS system would have a strong lead over the competition.
Real Time and Fault Tolerant Systems
It’s no secret that the Internet has grown into an abundant, international resource that many people use — and rely on — daily. “Approximately 1.5 billion people worldwide use Internet today, and Internet usage continues to increase exponentially. A recent survey revealed that approximately 78% to 80% of the people in the age group of 18-50, use Internet,” (Arunnima, B.S. — no date).
From e-mail communications to online shopping, people can use the Internet to access the information or service they need whenever they want, from wherever they want, 24 hours a day, 7 days a week. This convenience and accessibility has led to an expectation that the services and information will be delivered no matter what. These expectations can be especially high for banking companies who offer online access to customer accounts and private information. Customers have come to expect that the services they need will not only be available, but reliable and secure. “Gone are those days where a customer would walk into a bank and wait for a representative to help do a fund transfer or to request for a demand draft. Expectations of customers have changed with the technological advancements in Internet and telecommunications. Today’s tech savvy customer would even want to deposit a cheque being at home at his/her convenience,” (Arunnima, B.S. — no date?)
Because of its popularity, Internet banking was the Web service chosen for this essay. More and more, people are embracing the convenience of online banking: “About 75% of American banking customers surveyed during an October 2008 study reported using online banking to keep track of their expenses. Not surprisingly, a similar number confirmed that they were watching finances more closely during the current economic downturn. Online banking reported the strongest growth among all channels — customers wanted to watch their finances more closely, at least cost, and only banking served both ends.” (Jaymalya Palit, no date.)
Banking customers now expect access to their money at all times, whether to simply check their financial status or to pay bills and transfer funds. This requires a Web service that can ensure that services are available around the clock and that failures and errors won’t bring the system down. While it may not be considered a “life or death” situation if a customer can’t get into his/her account at a critical time, it could cause distress and/or affect a person’s credit by missing a payment by the due date or not being able to transfer needed funds.
The user experience: Web 2.0
For the pseudo online banking service presented here, Web 2.0 will serve as the front-end software foundation. A proven and effective technology, Web 2.0 has the capabilities to provide a customer-centric model, which is particularly helpful in the banking industry. “Technology can now enable banks to provide personalised interaction on party assisted or even unassisted channels. Powered by Web 2.0 technology, Internet banking is moving towards greater personalization and interactivity,” (Jaymalya Palit, no date). These capabilities not only provide the appropriate next-generation technology, but also enable banks to establish better relationships with their customers, “Those banks that successfully deliver a memorable and unique customer experience, consistently across their offline and online channels, can hope to steal a march over their competitors,” (Jaymalya Palit, no date.)
For this design, Web 2.0 will be implemented as the user-interface of the Web service. The banking interface will be customized by user demands and feedback, with products and services displayed accordingly. The user access screen will be password- protected and will contain several sections that provide account information for the specific user, such as the different types of banking accounts, bill pay, transfers, banking statements, messages, and information on additional banking products. Furthermore, the information provided will have to be up-to-the-minute, allowing customers to see exactly what their financial status may be at any given time. To fully enrich the customer’s experience, this may include integration with third-party services such as financial news, stocks, and weather forecasts with information displayed in a multi-service window. The purpose of the multi-service window is to allow the user to open several service windows at one time, and without encountering an Internet “traffic jam,” and thus improving the customer experience. “This kind of development model enables banks it engineers to pay more attention to individual service development, respond quickly to financial innovation demand from business staff, and improve the service constantly,” (Chen, Hong & Yu, 2009).
A “Channel Handler” on the server side supports communication with the browser through the XML or JSON data formats. The server application must also manage the components of the Web 2.0 graphic user interface, or GUI. In addition, the Web 2.0 framework is responsible for loading the required resources and managing the data models, as well as presenting and organizing the GUI (Chen, Hong & Yu, 2009).
Fault tolerance and 24/7 availability
To complete the architecture of the online banking Web design, the NCAPS technology will be connected to the Web 2.0 framework. The NCAPS system will be running on a Tandem S4000 Cluster which provides NonStop-UX, a fault-tolerant version of Unix. This version runs on a two-node cluster of S4000 machines connected by Servernet (Laranjeira, 1998, p. 448 and R. Horst, 1995).
The backup/primary scheme of the NCAPS system will enable high availability as the primary application provides service, and the backup application is idle and prepared to take over if a failure occurs. Also, the multiple iterations of redundancy throughout the NCAPS system help provide the availability and reliability online banking customers expect, “The majority of communications applications, from cellular telephone conversations to credit card transactions, assume the availability of a reliable network. At this level, data are expected to traverse the network and to arrive intact at their destination,” (Medard & Lumetta, 2002).
The PPM will monitor the Web 2.0 functionality as well as applications and state changes and will take action when processes fail. With the PPM components, the system is mostly self-managed, however, the system administrator can access the system and enact changes via the CLI, or Command Line Interface if necessary.
The OftLib, or Open Fault Tolerance Library, offers another layer of fault tolerance and availability for the online banking Web service by managing applications according to predefined policies, and using checkpoints, detecting process hangs, and saving and restoring file descriptors. One example how the OftLib functionality applies to online banking is if a user is in his/her personal bank account and the Web site times out, automatically logging the user out. This pre-defined script is based on timing, and can be monitored and managed by the PPM, however, it could also serve a double purpose by protecting the system as the result of a process hang or other error and simultaneously providing identity protection for the user.
Furthermore, customers can encounter problems easily in the Internet banking environment because services are typically unmanned. The NCAPS system can help keep the service running by providing the appropriate back up mechanisms and by sending messages to the internal architecture. Essentially, the system will keep running despite faults that are encountered. By the time a system administrator is able to monitor the system, the PPM will most likely have resolved the issue, or the system will be running in backup mode, and he/she can intervene with processes or diagnose faults as required.
The earlier, traditional methods of Internet banking offered a poor user experience, with many operations requiring a full-screen refresh. With the combined Web 2.0 technology and NCAPS system offered by the online banking services outlined here, users gain an integrated view of services and a superior level of availability. Full-screen refreshes are no longer necessary and if faults are detected, the NCAPS system can refresh the system in approximately 10 seconds. The NCAPS system also provides the stability needed to keep critical real-time applications up and running and the Web 2.0 technology will allow the bank to provide the “next evolution of the traditional Internet bank” (Chen, Hong & Yu, 2009) that customers are looking for. The Web 2.0 technology is a positive platform for an Internet banking service to build on and it has the flexibility to incorporate new business models and technologies as they arise.
Arunnima, B.S., 2009. Web 2.0 in Banking and Financial Services Industry. InfoSys
Technologies Limited. Available at http://www.infosys.com/finacle/solutions/thoughtpapers.asp [Online] [April 25, 2010].
Chen, X.M., Hong, S.J., and S. Yu. 2009. Next-generation banking with Web 2.0. IBM.
Available at http://www.ibm.com/developerworks/web/library/wa-banking/#3.The Transformation and Challenges of Internet Banking|outline. [Online] [April 25, 2010].
Jaymalya Palit. 2009. Online Customer Experience: What Works. Available at http://www.infosys.com/finacle/solutions/thoughtpapers.asp [Online] [April 25, 2010].
Horst, R. 1995. “TNet: A Reliable System Area Network,” IEEE Micro, Vol. 15, No. 1, Feb,
Laranjeira, L.A. 1998. “NCAPS: Application High Availability in Unix Computer Clusters.”
Medard, M. And Lumetta, S. 2002. Network Reliability and Fault Tolerance. Available at http://www.google.com/search?sourceid=ie7&q=Network+Reliability+and+Fault+Tolerance&rls=com.microsoft:en-us:IE-SearchBox&ie=UTF-8&oe=UTF-8&rlz=1I7TSHB [Online][April 22, 2010].
Answers.com. Definition of redundancy. Available at http://www.answers.com/topic/redundancy-engineering [Online] [April 23, 2010].
Course Objectives. Real-Time and Fault-Tolerant Systems!
Are you busy and do not have time to handle your assignment? Are you scared that your paper will not make the grade? Do you have responsibilities that may hinder you from turning in your assignment on time? Are you tired and can barely handle your assignment? Are your grades inconsistent?
Whichever your reason is, it is valid! You can get professional academic help from our service at affordable rates. We have a team of professional academic writers who can handle all your assignments.
Students barely have time to read. We got you! Have your literature essay or book review written without having the hassle of reading the book. You can get your literature paper custom-written for you by our literature specialists.
Do you struggle with finance? No need to torture yourself if finance is not your cup of tea. You can order your finance paper from our academic writing service and get 100% original work from competent finance experts.
While psychology may be an interesting subject, you may lack sufficient time to handle your assignments. Don’t despair; by using our academic writing service, you can be assured of perfect grades. Moreover, your grades will be consistent.
Engineering is quite a demanding subject. Students face a lot of pressure and barely have enough time to do what they love to do. Our academic writing service got you covered! Our engineering specialists follow the paper instructions and ensure timely delivery of the paper.
In the nursing course, you may have difficulties with literature reviews, annotated bibliographies, critical essays, and other assignments. Our nursing assignment writers will offer you professional nursing paper help at low prices.
Truth be told, sociology papers can be quite exhausting. Our academic writing service relieves you of fatigue, pressure, and stress. You can relax and have peace of mind as our academic writers handle your sociology assignment.
We take pride in having some of the best business writers in the industry. Our business writers have a lot of experience in the field. They are reliable, and you can be assured of a high-grade paper. They are able to handle business papers of any subject, length, deadline, and difficulty!
We boast of having some of the most experienced statistics experts in the industry. Our statistics experts have diverse skills, expertise, and knowledge to handle any kind of assignment. They have access to all kinds of software to get your assignment done.
Writing a law essay may prove to be an insurmountable obstacle, especially when you need to know the peculiarities of the legislative framework. Take advantage of our top-notch law specialists and get superb grades and 100% satisfaction.
We have highlighted some of the most popular subjects we handle above. Those are just a tip of the iceberg. We deal in all academic disciplines since our writers are as diverse. They have been drawn from across all disciplines, and orders are assigned to those writers believed to be the best in the field. In a nutshell, there is no task we cannot handle; all you need to do is place your order with us. As long as your instructions are clear, just trust we shall deliver irrespective of the discipline.
Our essay writers are graduates with bachelor's, masters, Ph.D., and doctorate degrees in various subjects. The minimum requirement to be an essay writer with our essay writing service is to have a college degree. All our academic writers have a minimum of two years of academic writing. We have a stringent recruitment process to ensure that we get only the most competent essay writers in the industry. We also ensure that the writers are handsomely compensated for their value. The majority of our writers are native English speakers. As such, the fluency of language and grammar is impeccable.
There is a very low likelihood that you won’t like the paper.
Not at all. All papers are written from scratch. There is no way your tutor or instructor will realize that you did not write the paper yourself. In fact, we recommend using our assignment help services for consistent results.
We check all papers for plagiarism before we submit them. We use powerful plagiarism checking software such as SafeAssign, LopesWrite, and Turnitin. We also upload the plagiarism report so that you can review it. We understand that plagiarism is academic suicide. We would not take the risk of submitting plagiarized work and jeopardize your academic journey. Furthermore, we do not sell or use prewritten papers, and each paper is written from scratch.
You determine when you get the paper by setting the deadline when placing the order. All papers are delivered within the deadline. We are well aware that we operate in a time-sensitive industry. As such, we have laid out strategies to ensure that the client receives the paper on time and they never miss the deadline. We understand that papers that are submitted late have some points deducted. We do not want you to miss any points due to late submission. We work on beating deadlines by huge margins in order to ensure that you have ample time to review the paper before you submit it.
We have a privacy and confidentiality policy that guides our work. We NEVER share any customer information with third parties. Noone will ever know that you used our assignment help services. It’s only between you and us. We are bound by our policies to protect the customer’s identity and information. All your information, such as your names, phone number, email, order information, and so on, are protected. We have robust security systems that ensure that your data is protected. Hacking our systems is close to impossible, and it has never happened.
You fill all the paper instructions in the order form. Make sure you include all the helpful materials so that our academic writers can deliver the perfect paper. It will also help to eliminate unnecessary revisions.
Proceed to pay for the paper so that it can be assigned to one of our expert academic writers. The paper subject is matched with the writer’s area of specialization.
You communicate with the writer and know about the progress of the paper. The client can ask the writer for drafts of the paper. The client can upload extra material and include additional instructions from the lecturer. Receive a paper.
The paper is sent to your email and uploaded to your personal account. You also get a plagiarism report attached to your paper.
PLACE THIS ORDER OR A SIMILAR ORDER WITH US TODAY AND GET A PERFECT SCORE!!!
Place an order in 3 easy steps. Takes less than 5 mins.