Race Condition (Software)
Race condition in software is an undesirable event that can happen when multiple entities access or modify shared resources in a system. The system behaves correctly when these entities use the shared resources as expected. But sometimes due to uncontrollable delays, the sequence of operations may change due to relative timing of events. When this happens, the system may enter a state not designed for and hence fail. The "race" happens because this type of failure is dependent on which entity gains access to the shared resource first.
One definition of race condition describes it as "anomalous behavior due to unexpected critical dependence on the relative timing of events." This definition is broad in the sense that it can apply to both hardware and software race conditions. In software, race condition happens because of some shared resource whose access is not properly controlled by design.
Can you give one example of race condition?
Suppose a process calls a function that's supposed to increment a value (v) stored in a database or some shared memory. This operation is not atomic, meaning that it can be broken down into smaller steps. The function will read the current value from database (A), store it in memory as a function variable (B), increment it (C) and finally write the new value to database (D). This function's execution is therefore a sequence of steps A-D.
Race condition will happen if process P1 calls this function and proceeds to step C. Meanwhile, it gets preempted by the operating system and another process P2 gets its chance to run. P2 calls the same function, completes all the steps A-D and returns. When P1 resumes, it continues from step C using an old value (v), not P2's result (v+1).
Race condition would not have happened if P1 had completed all steps without preemption; or P2 had been prevented from executing the function until P1 had completed all steps.
How is it possible that my one-line code is causing a race condition?
What appears as a single line in a high-level programming language, will actually translate to multiple assembly instructions that involve multiple clock cycles. The executing process or thread can therefore get interrupted anywhere along this sequence of assembly instructions. Thus, race condition is still possible.
What systems are prone to race conditions?
Because race condition is related to timing, it can happen in any system where multiple processes "race" to access a shared resource. Hence, it's not limited to just real-time systems. Any multithreaded or distributed system can have race conditions.
Hardware containing multicore CPUs, common in parallel computing, can suffer from race condition since multiple processes are running at the same time on multiple cores or CPUs. In networked systems, the channel can be considered as a shared resource and if two users attempt to access the shared channel race condition will lead to a packet collision. Network links that have large delays, such as satellite links, can cause race condition. Multithreaded applications are prone to race condition.
Web applications can suffer from race condition since multiple clients are sending requests in parallel to the web server, which may spawn multiple threads or processes to serve the requests. Even for a single user session, race condition is possible. For example, a web chat application may periodically send an AJAX request to check for the latest updates. At the same time, the user may create a new chat message.
Are there any best practices for avoiding race condition?
The usual solution to avoid race condition is to serialize access to the shared resource. If one process gains access first, the resource is "locked" so that other processes have to wait for the resource to become available. Even if the operating system allows other processes to execute, they will get blocked on the resource. This allows the first process to access and update the resource safely. Once done, it will "release" the resource. One of the processes waiting on the resource will now get its chance to access the resource. Code protected this way using locks is called critical section. The idea of granting access to only one process is called mutual exclusion.
Having large critical sections will affect performance. Such code may be refactored into smaller critical sections. In real-time systems or kernel code, another solution is to turn off interrupts when entering (and turn on when leaving) a critical section.
How to detect race condition in a program?
The first symptom is that results are unpredictable. In fact, the random nature of race condition poses a problem for debugging since under debug mode the race condition may not appear. For this reason, code reviews are recommended to catch race condition. This really implies that best way to avoid race condition is by careful design and not by testing and debugging.
There are tools that perform dynamic analysis and flag possible race conditions that may occur. Examples include Coverity's Thread Analyzer for Java and Intel Inspector. Clang Thread Safety Analysis is a static analysis tool to detect race condition. ThreadSanitizer is a data race detector written in C++ used by Clang and Go languages.
Do real-time operating systems (RTOS) have provisions to mitigate race conditions?
The common mechanism used to allow safe access to shared resource is called mutex, which is short for mutual exclusion. Any process entering a critical section will first acquire the mutex. Other processes wanting access to the critical section will get blocked. The owning process will give up the mutex when leaving the critical section.
What separate mutexes from semaphores is the concept of ownership. A mutex can be released only by its owner. This is not so with semaphores that can be used for signalling state change and synchronization from one thread to another. While it's possible to use semaphores to protect critical sections, they come with their own set of problems such as accidental release, recursive deadlock, task-death deadlock, and priority inversion. Mutexes are therefore a better solution.
Are there different types of race conditions?
- General races: Programs designed to be deterministic behaving in nondeterministic manner.
- Data races: Failure with accessing nonatomic critical sections in nondeterministic programs.
They also claim that debugging data races is easier since it's local to the critical section. General races require analysis of entire execution to understand exactly how the expected deterministic behaviour deviated.
A data race need not result in a race condition in the sense that program correctness is not compromised. A race condition that's not a data race implies a general race, for which the accompanying figure is an example.
Can you give examples of products that failed due to race conditions?
Therac-25 was a software-controlled radiation therapy machine used in the 1980s. Six patients were overdosed and some died. Race condition was one of many reasons for the system's failure. Shared variables were accessed by both data entry subroutine and magnet control subroutine. When a race condition occurred, the magnet control subroutine failed to read new settings by the operator.
In August 2003, Northeastern parts of North America suffered a massive blackout. "There was a couple of processes that were in contention for a common data structure, and through a software coding error in one of the application processes, they were both able to get write access to a data structure at the same time. And that corruption led to the alarm event application getting into an infinite loop and spinning."
NASA's Mars Spirit Rover suffered a race condition whereby an initialization process could not obtain write access to a variable. An exception occurred. In another case, an imaging module was attempting to read from memory while a deactivation process was also triggered. Other problems, possibly involving race condition, have been reported.
Can race conditions be useful?
Race conditions are something software designers wish to avoid. But some researchers have attempted to show that they can be used to generate random numbers. This is dependent on the operating system's scheduler, which by itself is not random. Randomness is obtained by triggering context switches based on "execution environment, cache misses, the instruction execution pipeline and the imprecision of the hardware clock used to generate timer interrupts."
In describing THE Multiprogramming System developed for a Dutch machine called EL X8, Dijkstra explains how semaphores can be used to enable mutual exclusion on critical sections. In an example, he names a semaphore as "mutex". However, this is just a semaphore. It's quite different from the concept mutexes that take shape in the 1980s.
The concept of mutex is developed in the late 1980s. The exact history of mutexes is uncertain. They probably came about in the course of specifying the IEEE Std 1003.1, often known as POSIX. This standard is first released in 1988. At this date, it mentions semaphores but not mutexes. Mutexes probably got introduced into the standard in the 1990s.
- Clang. 2017a. "Thread Safety Analysis." Clang 5 Documentation. Accessed 2017-03-09.
- Clang. 2017b. "ThreadSanitizer." Clang 5 Documentation. Accessed 2017-03-12.
- Clark, Lin. 2017. "Avoiding race conditions in SharedArrayBuffers with Atomics." Mozilla Hacks, Mozilla, June 14. Accessed 2020-08-16.
- Colesa, A., R. Tudoran, and S. Banescu. 2008. "Software Random Number Generation Based on Race Conditions." 10th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, Timisoara. pp. 439-444. Accessed 2017-03-08.
- Cooling, Niall. 2019a. "Mutex vs. Semaphore - Part 1." Embedded Related, April 12. Accessed 2020-08-16.
- Cooling, Niall. 2019b. "Mutex vs. Semaphores – Part 2: The Mutex & Mutual Exclusion Problems." Embedded Related, May 15. Accessed 2020-08-16.
- Daya. 2012. "Principles of Operating Systems." Accessed 2017-03-09.
- Dijkstra, E. W. 1968. "The Structure of the 'THE'-Multiprogramming System." Comm. of the ACM, vol. 11, no. 5, pp. 341-346, May. Accessed 2020-08-16.
- Golang. 2017. "Data Race Detector." The Go Programming Language. Accessed 2017-03-12.
- IEEE. 1988. "Portable operating system interface for computer environments." IEEE Std 1003.1-1998, IEEE Standard, Federal Information Processing Standards Publication, IEEE. Accessed 2020-08-16.
- Javarevisited. 2012. "What is Race Condition in Multithreading – 2 Examples in Java." Javarevisited, February. Accessed 2017-03-09.
- Jenkov, Jakob. 2016. "Race Conditions and Critical Sections." Jenkov.com, September 11. Accessed 2017-03-09.
- Joshi, Kavya. 2017. "Looking Inside a Race Detector." InfoQ, March 10. Accessed 2017-03-12.
- Leveson, Nancy. 1995. "Appendix A: Medical Devices: The Therac-25." Safeware: System Safety and Computers, Addison-Wesley. Accessed 2017-03-10.
- Micrium. 2020. "Mutual Exclusion Semaphores." Micrium Documentation. Accessed 2020-08-16.
- Microsoft. 2012. "Description of race conditions and deadlocks." Microsoft Support, June 18. Accessed 2017-03-09.
- Netzer, Robert H. B., and Barton P. Miller. 1992. "What are race conditions?: Some issues and formalizations." ACM Letters on Programming Languages and Systems, vol. 1, no. 1, pp. 74-88, March. Accessed 2020-08-16.
- Poulsen, Kevin. 2004. "Tracking the blackout bug." SecurityFocus, April 7. Accessed 2017-03-10.
- Regehr, John. 2011. "Race Condition vs. Data Race." Blog, Embedded in Academia, March 13. Accessed 2017-03-12.
- Shene, Ching-Kuang. 2020. "Race Conditions: Revisited." Concurrent Computing Project, Michigan Technological University. Accessed 2020-08-16.
- Somayaji, Anil. 2010. "COMP 3000 Essay 1 2010 Question 6." November 8. Accessed 2017-03-10.
- Vaughan, Jack. 2008. "Dynamic analysis tool from Coverity looks at concurrency defects." Tech Target, May 07. Accessed 2017-03-09.
- Wheeler, David A. 2015. "Avoid Race Conditions." Secure Programming HOWTO, September 19. Accessed 2017-03-08.
- Wikipedia. 2017. "Intel Inspector." Wikipedia, January 6. Accessed 2017-03-09.
- Wikipedia. 2020. "Race condition." Wikipedia, March 1. Accessed 2020-08-16.
- Wilson, Ron. 2004. "The trouble with Rover is revealed." EE Times, February 20. Accessed 2017-03-10.
- Race Condition (Hardware)
- Concurrent Computing
- Multiprocessing vs Multithreading
- Critical Section
- Mutual Exclusion