Load-link/store-conditional

In computer science, load-linked/store-conditional^[1] (LL/SC), sometimes known as load-reserved/store-conditional^[2] (LR/SC), are a pair of instructions used in multithreading to achieve synchronization. Load-link returns the current value of a memory location, while a subsequent store-conditional to the same memory location will store a new value only if no updates have occurred to that location since the load-link. Together, this implements a lock-free, atomic, read-modify-write operation.

"Load-linked" is also known as load-link,^[3] load-reserved,^[2] and load-locked.^{[citation needed]}

LL/SC was originally^[4] proposed by Jensen, Hagensen, and Broughton for the S-1 AAP multiprocessor^[1]^{[failed verification]} at Lawrence Livermore National Laboratory.

Comparison of LL/SC and compare-and-swap

If any updates have occurred, the store-conditional is guaranteed to fail, even if the value read by the load-link has since been restored. As such, an LL/SC pair is stronger than a read followed by a compare-and-swap (CAS), which will not detect updates if the old value has been restored (see ABA problem).

Real implementations of LL/SC do not always succeed even if there are no concurrent updates to the memory location in question. Any exceptional events between the two operations, such as a context switch, another load-link, or even (on many platforms) another load or store operation, will cause the store-conditional to spuriously fail. Older implementations will fail if there are any updates broadcast over the memory bus. This is called weak LL/SC by researchers, as it breaks many theoretical LL/SC algorithms.^[5] Weakness is relative, and some weak implementations can be used for some algorithms.

LL/SC is more difficult to emulate than CAS. Additionally, stopping running code between paired LL/SC instructions, such as when single-stepping through code, can prevent forward progress, making debugging tricky.^[6]

Nevertheless, LL/SC is equivalent to CAS in the sense that either primitive can be implemented in terms of the other, in O(1) and in a wait-free manner.^[7]

Implementations

LL/SC instructions are supported by:

Alpha: ldl_l^[8]/stl_c^[9] and ldq_l^[8]/stq_c^[9]
PowerPC/Power ISA: lwarx/stwcx and ldarx/stdcx^[10]^[11]
MIPS: ll/sc^[12] and lld/scd^[13]
ARM: ldrex/strex (ARMv6,^[14] v7 and v8-M^[15]), and ldxr/stxr (ARMv8-A^[16])
RISC-V: lr/sc^[2]
ARC: LLOCK/SCOND

Some CPUs^[which?] require the address being accessed exclusively to be configured in write-through mode.

Typically, CPUs track the load-linked address at a cache-line or other granularity, such that any modification to any portion of the cache line (whether via another core's store-conditional or merely by an ordinary store) is sufficient to cause the store-conditional to fail.

All of these platforms provide weak^{[clarification needed]} LL/SC. The PowerPC implementation allows an LL/SC pair to wrap loads and even stores to other cache lines (although this approach is vulnerable to false cache line sharing). This allows it to implement, for example, lock-free reference counting in the face of changing object graphs with arbitrary counter reuse (which otherwise requires double compare-and-swap, DCAS). RISC-V provides an architectural guarantee of eventual progress for LL/SC sequences of limited length.

Some ARM implementations define platform dependent blocks, ranging from 8 bytes to 2048 bytes, and an LL/SC attempt in any given block fails if there is between the LL and SC a normal memory access inside the same block. Other ARM implementations fail if there is a modification anywhere in the whole address space. The former implementation is the stronger and most practical.

LL/SC has two advantages over CAS when designing a load–store architecture: reads and writes are separate instructions, as required by the design philosophy (and pipeline architecture); and both instructions can be performed using only two registers (address and value), fitting naturally into common 2-operand ISAs. CAS, on the other hand, requires three registers (address, old value, new value) and a dependency between the value read and the value written. x86, being a CISC architecture, does not have this constraint; though modern chips may well translate a CAS instruction into separate LL/SC micro-operations internally.

Extensions

Hardware LL/SC implementations typically do not allow nesting of LL/SC pairs.^[17] A nesting LL/SC mechanism can be used to provide a MCAS primitive (multi-word CAS, where the words can be scattered).^[18] In 2013, Trevor Brown, Faith Ellen, and Eric Ruppert implemented in software a multi-address LL/SC extension (which they call LLX/SCX) that relies on automated code generation;^[19] they have used it to implement one of the best-performing concurrent binary search tree (actually a chromatic tree), slightly beating the JDK CAS-based skip list implementation.^[20]

References

^ ^a ^b "S-1 project". Stanford Computer Science wiki. 2018-11-30.
^ ^a ^b ^c Andrew Waterman; Krste Asanović, eds. (2017-05-07). "7.2 Load-Reserved/Store-Conditional Instructions". The RISC-V Instruction Set Manual, Volume 1: User-Level ISA, Version 2.2 (PDF).
^ US20030217115A1, Rowlands, Joseph, "Load-linked/store conditional mechanism in a CC-NUMA system", issued 2003-11-20
^ Herlihy, Maurice (1993-11-01). "A methodology for implementing highly concurrent data objects". ACM Transactions on Programming Languages and Systems. 15 (5): 745–770. doi:10.1145/161468.161469. ISSN 0164-0925.
^ Beckmann, Nathan. "Synchronization" (PDF). 15-740: Computer Architecture, Fall 2018. Carnegie Mellon University. Retrieved 23 April 2021.
^ Keno Fischer (2020-05-02). "Julia 1.5 Feature Preview: Time Traveling (Linux) Bug Reporting". Retrieved 2020-05-14.
^ James H. Anderson; Mark Moir (1995). "Universal constructions for multi-object operations". PODC '95 Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing. ACM. pp. 184–193. doi:10.1145/224964.224985. ISBN 0-89791-710-3. S2CID 8204331. See their Table 1, Figures 1 & 2 and Section 2 in particular.
^ ^a ^b "Alpha Architecture Reference Manual" (PDF). pp. 4–9~4–12. Retrieved 2024-01-26.
^ ^a ^b "Alpha Architecture Reference Manual" (PDF). pp. 4–13~4–15. Retrieved 2024-01-26.
^ May, Cathy; Silha, Ed; Simpson, Eick; Warren, Hank (1993). The PowerPC architecture: A SPECIFICATION FOR A NEW FAMILY OF RISC PROCESSORS. Morgan Kaufmann PUblishers, Inc. pp. 336–338, 465. ISBN 1-55860-316-6.
^ Kacmarcik, Cary (1995). Optimizing PowerPC Code. Addison-Wesley Publishing Company. pp. 71–72. ISBN 0-201-40839-2.
^ "APPLICATION NOTE MIPS R4000 Synchronization Primitives" (PDF). p. 9. Retrieved 2023-12-27.
^ "APPLICATION NOTE MIPS R4000 Synchronization Primitives" (PDF). p. 5. Retrieved 2023-12-27.
^ "ARM11 MPCore™ Processor Revision: r2p0 Technical Reference Manual". p. 301-302(8-7,8-8). Retrieved 2023-12-14.
^ "Arm®v8-M Architecture Reference Manual". p. 278. Retrieved 2023-12-14.
^ "ARMv8-A Synchronization primitives". p. 6. Retrieved 2023-12-14.
^ James R. Larus; Ravi Rajwar (2007). Transactional Memory. Morgan & Claypool. p. 55. ISBN 978-1-59829-124-7.
^ Keir Fraser (February 2004). Practical lock-freedom (PDF) (Technical report). University of Cambridge Computer Laboratory. p. 20. UCAM-CL-TR-579.
^ Brown, Trevor; Ellen, Faith; Ruppert, Eric (2013). "Pragmatic primitives for non-blocking data structures" (PDF). PODC '13 Proceedings of the 2013 ACM symposium on Principles of distributed computing. ACM. pp. 13–22. arXiv:1712.06688. doi:10.1145/2484239.2484273. ISBN 978-1-4503-2065-8. S2CID 6537417. See also slides
^ Trevor Brown; Faith Ellen; Eric Ruppert (2014). "A general technique for non-blocking trees" (PDF). PPoPP '14 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM. pp. 329–342. arXiv:1712.06687. doi:10.1145/2555243.2555267. ISBN 978-1-4503-2656-8. S2CID 9442380.

Jensen, Eric H.; Hagensen, Gary W.; Broughton, Jeffrey M. (November 1987). A New Approach to Exclusive Data Access in Shared Memory Multiprocessors (PDF) (Technical report). Lawrence Livermore National Laboratory. UCRL-97663. Archived from the original (PDF) on 2017-02-02. Retrieved 2012-02-22.
Bruner, John D.; Hagensen, Gary W.; Jensen, Eric H.; Pattin, Jay C.; Broughton, Jeffrey M. (11 November 1987). Cache Coherence on the S-1 AAP (PDF) (Technical report). Lawrence Livermore National Laboratory. UCRL-97646. Archived from the original (PDF) on 2 February 2017. Retrieved 10 November 2013.
Detlefs, D.; Martin, P.; Moir, M.; Steele, Jr., Guy L. (2001). "Lock-free reference counting". PODC '01 Proceedings of the twentieth annual ACM symposium on Principles of distributed computing. ACM. pp. 190–9. CiteSeerX 10.1.1.92.8221. doi:10.1145/383962.384016. ISBN 1-58113-383-9.
Reinholtz, Kirk (December 2004). "Atomic Reference Counting Pointers". C/C++ Users Journal.
Sites, R. L. (February 1993). "Alpha AXP architecture". Comm. ACM. 36 (2): 33–44. doi:10.1145/151220.151226. S2CID 5473184.

[S-1-1] "S-1 project". Stanford Computer Science wiki. 2018-11-30.

[RISCV-2] Andrew Waterman; Krste Asanović, eds. (2017-05-07). "7.2 Load-Reserved/Store-Conditional Instructions". The RISC-V Instruction Set Manual, Volume 1: User-Level ISA, Version 2.2 (PDF).

[3] US20030217115A1, Rowlands, Joseph, "Load-linked/store conditional mechanism in a CC-NUMA system", issued 2003-11-20

[4] Herlihy, Maurice (1993-11-01). "A methodology for implementing highly concurrent data objects". ACM Transactions on Programming Languages and Systems. 15 (5): 745–770. doi:10.1145/161468.161469. ISSN 0164-0925.

[5] Beckmann, Nathan. "Synchronization" (PDF). 15-740: Computer Architecture, Fall 2018. Carnegie Mellon University. Retrieved 23 April 2021.

[Fischer2020-6] Keno Fischer (2020-05-02). "Julia 1.5 Feature Preview: Time Traveling (Linux) Bug Reporting". Retrieved 2020-05-14.

[Anderson1995-7] James H. Anderson; Mark Moir (1995). "Universal constructions for multi-object operations". PODC '95 Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing. ACM. pp. 184–193. doi:10.1145/224964.224985. ISBN 0-89791-710-3. S2CID 8204331. See their Table 1, Figures 1 & 2 and Section 2 in particular.

[alpha-ll-8] "Alpha Architecture Reference Manual" (PDF). pp. 4–9~4–12. Retrieved 2024-01-26.

[alpha-sc-9] "Alpha Architecture Reference Manual" (PDF). pp. 4–13~4–15. Retrieved 2024-01-26.

[10] May, Cathy; Silha, Ed; Simpson, Eick; Warren, Hank (1993). The PowerPC architecture: A SPECIFICATION FOR A NEW FAMILY OF RISC PROCESSORS. Morgan Kaufmann PUblishers, Inc. pp. 336–338, 465. ISBN 1-55860-316-6.

[11] Kacmarcik, Cary (1995). Optimizing PowerPC Code. Addison-Wesley Publishing Company. pp. 71–72. ISBN 0-201-40839-2.

[12] "APPLICATION NOTE MIPS R4000 Synchronization Primitives" (PDF). p. 9. Retrieved 2023-12-27.

[13] "APPLICATION NOTE MIPS R4000 Synchronization Primitives" (PDF). p. 5. Retrieved 2023-12-27.

[14] "ARM11 MPCore™ Processor Revision: r2p0 Technical Reference Manual". p. 301-302(8-7,8-8). Retrieved 2023-12-14.

[15] "Arm®v8-M Architecture Reference Manual". p. 278. Retrieved 2023-12-14.

[16] "ARMv8-A Synchronization primitives". p. 6. Retrieved 2023-12-14.

[LarusRajwar2007-17] James R. Larus; Ravi Rajwar (2007). Transactional Memory. Morgan & Claypool. p. 55. ISBN 978-1-59829-124-7.

[Fraser2004-18] Keir Fraser (February 2004). Practical lock-freedom (PDF) (Technical report). University of Cambridge Computer Laboratory. p. 20. UCAM-CL-TR-579.

[Brown2013-19] Brown, Trevor; Ellen, Faith; Ruppert, Eric (2013). "Pragmatic primitives for non-blocking data structures" (PDF). PODC '13 Proceedings of the 2013 ACM symposium on Principles of distributed computing. ACM. pp. 13–22. arXiv:1712.06688. doi:10.1145/2484239.2484273. ISBN 978-1-4503-2065-8. S2CID 6537417. See also slides

[Brown2014-20] Trevor Brown; Faith Ellen; Eric Ruppert (2014). "A general technique for non-blocking trees" (PDF). PPoPP '14 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM. pp. 329–342. arXiv:1712.06687. doi:10.1145/2555243.2555267. ISBN 978-1-4503-2656-8. S2CID 9442380.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

Comparison of LL/SC and compare-and-swap

Implementations

Extensions

See also

References