TM Program Concurrency Analysis: Parallelism versus Number of Schedules
Miroslav Popovic, Miodrag Djukic and Nenad Cetic
Transactional Memory [1-3] is a promising concurrency control mechanism that simplifies concurrent programming by providing atomic, compositional blocks of program code. Until recently, TM was predominantly a subject of extensive R&D activities, but since newest multicores rolled out by IBM  and Intel  include TM hardware, it seems that TM might be finally accepted more widely. Still, one of the key issues that TM skeptics always bring into the discussion is TM program performance. Researchers mostly interested in TM implementations typically use algorithm complexity as a theoretical measure and throughput as a practical performance measure. For example, Diegues and Cachopo  use this approach in their recent technical report. Besides this more conventional approach, some of the researchers are proposing other performance measures. For example, Gramoli  proposed that the set of schedules (interleavings of steps of the sequential program), should be used as a measure of the “amount of concurrency”. In contrast to Gramoli , we in our previous work  propose an approach to analysis of TM programs by using a well-established work-span methodology, which is based on modeling programs as DAGs, and calculating their work, span, parallelism, and speedup. In our adaptation of the work-span methodology, work of a given group of transactions is simply the sum of individual execution times of all the transactions, whereas the span (i.e. the critical path) is practically the worst case parallel execution time of the complete group of transactions on an ideal parallel computer.
In this work, we compare the particular values of parallelism calculated by our method with the number of interleavings for some special cases of the Simple Bank program that was introduces in our previous work. Besides the fact that we already analyzed these cases in our previous work, thus it was the easiest starting point for the comparison, there is no other special reason why we selected these cases. The conclusions of this work are the following: (1) the parallelism and the number of interleavings are completely different measures, (2) the number of interleavings is an absolute measure that does not provide any information on parallelism that exists in a concurrent program, nor the speedup that a concurrent program may achieve, and (3) the work, span, parallelism, and speedup are measures that are more meaningful then the number of interleavings. Therefore, we recommend using adapted work-span methodology when analyzing TM program concurrency and performance.
 M. Herlihy, and J. E. B. Moss, “Transactional memory: architectural support for lockfree data structures”, in Proc. of the 20th Annual International Symposium on Computer Architecture, New York, 1993, pp. 289–300.
 R. Guerraoui and M. Kapalka, Principles of Transactional Memory. Morgan and Claypool, 2010.
 T. Harris, J.R. Larus, and R. Rajwar, Transactional Memory, 2nd edition. Morgan and Claypool, 2010.
 R.A. Haring, M. Ohmacht; T.W. Fox, M.K. Gschwind, D.L. Satterfield, K. Sugavanam, P.W. Coteus, P. Heidelberger, M.A. Blumrich, R.W. Wisniewski, A. Gara, G.L.-T. Chiu, P.A. Boyle, N.H. Chist, Changhoan Kim, "The IBM Blue Gene/Q Compute Chip", IEEE Micro, 32(2), 2012, pp. 48-60.
 J. Reinders, "Transactional Synchronization in Haswell", July 2012. [Online]. Available: http://software.intel.com/en-us/blogs/2012/02/07/transactional-synchronization-in-haswell.
 N. Diegues and J. Cachopo, “Exploring parallelism in transactional workloads”, INESC-ID Lisbon, Tech. Rep. RT/16/2012, June 2012.
 V. Gramoli, P. Kuznetsov, and S. Ravi, “Sharing a Sequential Data Structure: Correctness Definition and Correctness Analysis”, The Fourth Workshop on the Theory of Transactional Memory, 2012. [Online]. Available: http://lpd.epfl.ch/gramoli/wttm4.
 M. Popovic, I. Basicevic, M. Djukic, and N. Cetic. “Estimating Parallelism of Transactional Memory Programs”. In Proc. of the 3rd IEEE International Conference on Information Science and Technology, Yangzhou, Jiangsu, China, 2013.
Enhancing Permissiveness in Transactional Memory via Time-Warping
Nuno Diegues and Paolo Romano
The notion of permissiveness  in Transactional Memory (TM) translates to only aborting a transaction when it cannot be accepted in any history that guarantees a target correctness criterion. Unfortunately, this desirable property is often neglected by state of the art TMs [2,3,4,5]. In this context, existing literature on TM has highlighted an inherent trade-off between the efficiency achievable by an implementation of a TM algorithm, and the number of spurious aborts it introduces . Indeed, in order to maximize implementation's efficiency, most TMs resort to aborting transactions under overly conservative conditions and, hence, incur in the risk of rejecting a significant number of serializable histories.
In this talk we will describe Time-Warp Multi-version (TWM), a novel multi-version concurrency control algorithm that aims to strike an unexplored balance between permissiveness and efficiency. TWM is based on the key idea of allowing a write transaction that has performed stale reads (i.e. missed the writes of concurrently committed transactions) to be safely serialized by "committing it in the past", which we call a time-warp commit. At its core TWM uses a novel, lightweight validation mechanism that can be implemented with minimum bookkeeping and computational overheads, thus ensuring efficiency. The key idea is to exclusively track some anti-dependencies developed by a committing transaction, hence avoiding onerous validation of the entire conflicts' graph (unlike algorithms that ensure permissiveness ). Another noteworthy feature of TWM is that it preserves what is probably the most important property of MVCC algorithms: the ability to ensure that read-only transactions can never be aborted (a property known as mv-permissiveness ).
We provide an extensive analysis of the correctness properties of TWM, demonstrating that it guarantees Virtual World Consistency , a safety property (strictly stronger than serializability) that is deemed as particularly relevant in the context of TM. We have also implemented a highly optimized, lock-free version of the TWM algorithm and compared it with two state of the art TMs using three standard benchmarks. Our evaluation study demonstrates the practicality of the TWM algorithm by highlighting consistent performance improvements across all considered workloads: TWM achieves 98% average speed up, with gains extending up to 7x in high concurrency scenarios. These results are a consequence of the drastic reduction of aborts achieved by TWM with respect to less permissive algorithms.
 Rachid Guerraoui, Thomas A. Henzinger, and Vasu Singh. Permissiveness in transactional memories. In Proceedings of the 22nd international symposium on Distributed Computing, DISC ’08, pages 305–319, Berlin, Heidelberg, 2008. Springer-Verlag.
 Pascal Felber, Christof Fetzer, and Torvald Riegel. Dynamic performance tuning of word-based software transactional memory. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, PPoPP ’08, pages 237–246, New York, NY, USA, 2008. ACM.
 Dave Dice, Ori Shalev, and Nir Shavit. Transactional locking ii. In Proceedings of the 20th international conference on Distributed Computing, DISC’06, pages 194–208, Berlin, Heidelberg, 2006. Springer-Verlag.
 Aleksandar Dragojevic, Rachid Guerraoui, and Michal Kapalka. Stretching transactional memory. In Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation, PLDI ’09, pages 155–165, New York, NY, USA, 2009. ACM.
 Sérgio Miguel Fernandes and João Cachopo. Lock-free and scalable multi-version software transactional memory. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming,
PPoPP ’11, pages 179–188, New York, NY, USA, 2011. ACM.
 Hany E. Ramadan, Indrajit Roy, Maurice Herlihy, and Emmett Witchel. Committing conflicting transactions in an stm. In Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP ’09, pages 163–172, New York, NY, USA, 2009. ACM.
 Dmitri Perelman, Rui Fan, and Idit Keidar. On maintaining multiple versions in stm. In Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing, PODC ’10, pages 16–25, New York, NY, USA, 2010. ACM.
 Damien Imbs and Michel Raynal. Virtual world consistency: A condition for stm systems (with a versatile protocol with invisible read operations). Theoretical Computer Science, 444(0):113–127, 2012.
Tackling Weak-Atomicity in Multiversion Algorithms
Ricardo Dias, Tiago Vale and João Lourenço
Traditionally, multiversion algorithms for transactional memory maintain a version list for each managed memory location or object field, and all the read and write operations manipulate the data values in that list. This implies that all accesses to fields in shared objects must be redirected to the version list, and hence must be enclosed in a memory transaction. Therefore, multiversion algorithms require a strong atomicity model, which may be implemented by warping each non-transactional access in a small transaction, incurring in a non-negligible performance penalty.
When using a weak atomicity model, updates made by non-transactional code to object fields are not seen by transactional code and, on the other way around, updates made by transactional code are not seen by non-transactional code. For instance, if an application programmer initializes a shared data-structure without wrapping the initialization procedure in a memory transaction, further transactional accesses, to such data-structure, will not see the initialization and the program may crash.
In a previous work  we presented an extension of DeuceSTM that enables the efficient support of multiversion algorithms. DeuceSTM does not provide a strong atomicity model, and allows non-transactional accesses to fields of objects that were also accessed inside memory transactions. This hinders the usage of multiversion algorithms in DeuceSTM. In  we addressed this problem by rewriting the existing benchmarks to wrap all accesses to shared objects inside an atomic method. Such code changes, when feasible, are always a cumbersome and error prone process.
In this work we present an adaptation of traditional multiversion algorithms to support a weak atomicity execution model. The key idea for our solution is to store the value of the latest version in the object’s field instead of in the node at the head of the version list. The commit algorithm must also be enhanced to guarantee the correctness of this approach, as it introduces an additional step of updating the object's field data and the head node of the version list, which must also be performed atomically. Our approach to support the weak atomicity model incurs in a negligible performance overhead, allowing efficient transactional and non-transactional interaction using multiversion algorithms.
 Dias RJ, Vale TM, Lourenc ̧o JM. Efficient support for in-place metadata in transactional memory. Euro-Par 2012 Parallel Processing, Lecture Notes in Computer Science, vol. 7484, Kaklamanis C, Papatheodorou T, Spirakis P (eds.). Springer Berlin / Heidelberg; 589–600.
The hardships of TLSTM, the first Transactional Memory with Thread Level Speculation algorithm
Ricardo Filipe and Joao Barreto
Transactional Memory with Thread Level Speculation (TM+TLS) has been proposed as a new programming paradigm that achieves higher levels of parallelism than ever before, by dividing TM transactions into TLS tasks that will execute in parallel, each on its own hardware thread.
The TLSTM implementation of the TM+TLS model still has some problems that it has to overcome, so that TM+TLS achieves its final goal of globally replacing TM as the programming paradigm of choice.
The TLSTM algorithm is very naïve on operations that require synchronization of a transaction's tasks. TLSTM also lacks good usage of idle task time. These are the main problems of TLSTM:
Transaction validation is executed separately by each task
Transaction abort requires that all of its tasks synchronize to abort the transaction atomically.
If a task of some transaction completes its execution early, it starts busy waiting until all of the transaction's tasks complete
The first problem occurs because in TLSTM the last task of the transaction needs to check all of that transaction's tasks timestamps in order to see if they are all valid at the same point of execution. If they are not, the whole transaction needs to be validated. If an invalid read is detected, all of the transaction's tasks need to abort and restart execution.
The second problem translates into tasks having to detect that they are supposed to abort because of a transaction write conflict and wait for all of its transaction's tasks to synchronize at that point. We need this synchronization because several tasks of a transaction may write to the same address in memory, which may lead to deadlocks if tasks are let to abort separately.
The final problem we have identified thus far occurs because a task is put on busy wait once it has completed its execution, checking if it should abort because of a conflict on the transaction it belongs to. If we decide that a task can advance to other code after finishing its current execution, there are problems that may come up when aborting transactions in TLSTM .
As we have explained, these are not easy problems to tackle, they are serious hardships. Each of these problems requires a solution and opens up research paths in the TM+TLS world, which other researchers should see as opportunities.
Abort Free SemanticTM by Dependency Aware Scheduling of Transactional Instructions
Shlomi Dolev, Panagiota Fatourou, and Eleftherios Kosmas
We present a TM system that executes transactions without ever causing any aborts. The system uses a set of t-var lists, one for each transactional variable. A scheduler undertakes the task of placing the instructions of each transaction in the appropriate t-var lists based on which t-variable each of them accesses. A set of worker threads are responsible to execute these instructions. Because of the way instructions are inserted in and removed from the lists, by the way the worker threads work, and by the fact that the scheduler places all the instructions of a transaction in the appropriate tvar lists before doing so for the instructions of any subsequent transaction, it follows that no conflict will ever occur. Parallelism
is fine-grained since it is achieved at the level of transactional instructions instead of transactions themselves (i.e., the instructions of a transaction may be executed concurrently).
Session 2: Hardware Transactional Memory and Scheduling
Towards Transactional Memory for Safety-Critical Embedded Systems
Stefan Metzlaff, Sebastian Weis and Theo Ungerer
In this talk, we argue that Transactional Memory (TM), which mainly has been of concern in general purpose systems, can also be employed in safety-critical embedded real-time systems with all its advantages. Therefore, TM has to suffice hard real-time (HRT) constraints and dependability requirements. We propose a TM for safety-critical systems that bounds the worst-case execution time (WCET) of parallel applications and ensures their fault-tolerant execution.
To provide a predictable execution of parallel applications using TM, the timing of transactions has to be predictable. The main challenge is to bound the interferences a transaction may suffer, i.e. the maximum number of transaction retries required to solve all its potential conflicts. Therefore, we propose a TM contention manager, that guarantees the transactions to be committed in the order of their arrival, similar to [1,2]. To also support mixed criticality applications, in which transactions of high and low criticality share data, the TM system has to ensure that any transaction of low criticality influences high critical transactions only in a defined and limited manner. Otherwise, no valid timing bounds for the high critical transactions are calculable. Therefore, we extended the contention manager to perform a prioritised commit, which defers all low critical transactions while high critical transactions are active. So, for high critical transactions HRT guarantees are provided. Whereas, low critical transactions may use TM in a best effort manner, but no guarantees can be given.
Preliminary results showed that our approach limits the interference of low critical transactions on high critical transactions to minimum.
Further, we plan to enhance our approach to also ensure fault tolerance [3,4]. Vulnerable code will be encapsulated by fault-tolerance transactions that are executed redundantly. On commit of fault-tolerance transactions the TM system compares their output, i.e. transactional write set and register file. In case of error the faulty transactions will be recovered. Depending on the criticality of the application, we also intend to adjust the level of redundancy, e.g. by allowing high critical tasks to issue more than two redundant transactions in parallel and then using forward recovery in case of error, while low critical tasks use only time redundancy and roll-back recovery.
Our approach joins predictable execution and fault tolerance by exploiting TM to build a system that bounds the delays of transactional conflicts and provides fault containment, detection, and recovery. This is the first step towards a predictable multi-core system for parallel fault-tolerant HRT applications.
 Sherif F. Fahmy, Binoy Ravindran, and E. D. Jensen. On bounding response times under software transactional memory in distributed multiprocessor real-time systems.In Proceedings of the Conference on Design, Automation and Test in Europe, DATE’09, pages 688–693. European Design and Automation Association, 2009.
 Martin Schoeberl, Florian Brandner, and Jan Vitek. RTTM: real-time transactional memory. In Proceedings of the 2010 ACM Symposium on Applied Computing, SAC’10, pages 326–333. ACM, 2010.
 D. Sanchez, J.L. Aragon, and J.M. Garcia. A log-based redundant architecture for reliable parallel computation. In Proceedings of the International Conference on High Performance Computing, HiPC ’10, pages 1–10. IEEE Computer Society, 2010.
 Gulay Yalcin, Osman Unsal, Ibrahim Hur, Adrian Cristal, and Mateo Valero. FaulTM: Fault-Tolerance Using Hardware Transactional Memory. In Proceedings of the Workshop on Parallel Execution of Sequential Programs on Multi-core Architecture, Pespma ’10.
Deterministic execution of TM applications
Vesna Smiljkovic, Christof Fetzer, Osman Unsal, Adrian Cristal and Mateo Valero
With the advance of multicore systems, threads can be executed in parallel and therefore interleave nondeterministically. With nondeterminism, executions of threads can produce different outputs even when they start with the same input. Sources of nondeterminism are various such as: other processes running in parallel, the state of memory, caches or any microarchitecture structure.
Nondeterministic programs are hard to test and debug. On the other hand, determinism provides repeatable execution, i.e. for the same input a program gives the same output. In the presence of a bug determinism helps a programmer to replicate, find and fix the bug. Importantly for fault-tolerant and highly available systems, any mismatch in replicas’ outputs is only due to faulty or attacked execution one of the replicas when the replicas execute deterministically.
In this work we propose deterministic execution of TM applications. We take advantage of functionalities of a TM compiler and a TM library: a TM compiler instruments memory accesses and distinguishes shared data accesses from thread-local data accesses, and a TM library provides synchronization of the shared data accesses. Our modifications remain within a state-of-the-art TM library and do not require modifications of operating systems, system libraries nor benchmarks. We propose several different implementations, starting with the simplest solution of serializing transactions and providing more complex solutions to exploit more parallelism. They all use the round-robin scheduling algorithm. As a preliminary work, we implemented a deterministic TM system based on the global lock, evaluated it and verified its correctness.
Further more, we intent to provide strong determinism by deterministically executing code within and outside of transactions. We also have to support synchronization primitives (like lock acquisitions and synchronization barriers). Finally, if we provide different scheduling algorithms, we can cover many (ideally all) cases of threads interleaving, which can be used for exhausting benchmark testing. After testing is done, benchmarks can be executed nondeterministically to achieve better performance.
Fine-grained transaction scheduling
Hugo Rito and João Cachopo
Software Transactional Memory (STM) is one promising abstraction to simplify the task of writing highly parallel applications. Yet, STMs are still impractical in high-contention scenarios where frequent transaction reexecutions hinder the progress, and, therefore, the performance of the system. In practice, as the number of threads grow, even for read-dominated workloads the interaction between concurrent transactions may result in frequent transaction reexecutions, further highlighting STMs' inability to take full advantage of the available processing power.
Notwithstanding the problem of conflicts being related to STMs' optimistic approach to concurrency, we argue that an optimistic solution capable of dealing with conflicts efficiently may have benefits over a conservative approach that restricts the concurrency among transactions too much.
Our belief is that the information collected during a transaction, such as the transaction's read-set and write-set, is crucial to improve STM's performance in conflict-intensive workloads as it allows transactions to learn from their past and adapt accordingly in the future. For instance, an STM able to remember which transactions usually conflict with each other may avoid starting two transactions that will conflict in the future.
Current scheduling solutions, however, are too conservative as they over-serialize transactions---that is, two transactions are scheduled to execute one after the other when they may overlap. Thus, we intend to take advantage of the unique characteristics of a multi-versioned STM to allow transactions that usually conflict with each other to execute concurrently without conflicting. The idea is to delay each starting transaction based on which other transactions are executing, so that during the starting transaction's lifetime it may access a memory position written by another as soon as the other commits.
To exemplify, consider two read-write transactions (T1 and T2) that in the past conflicted on memory position M. If T2 tries to start while T1 is executing, a serializing scheduler blocks T2 until T1 commits successfully, hindering parallelism. On the other hand, with our solution both transactions may execute concurrently without conflicting as our added delay to T2's start ensures that T1's commit happens before T2's first access to memory position M. This way, when T2 eventually tries to read from M, the STM runtime upgrades T2's version number to a version compatible with T1's committed value, preventing T2's restart and, thus, improving performance.
Exploiting program phases in an FPGA-based Hybrid Transactional Memory system
Philipp Kirchhofer, Martin Schindewolf, Wolfgang Karl and Nehir Sonmez
Our aim in this ongoing research project is to enhance an existing HybridTM system with adaptive features by allowing to switch on-the-fly on a fine-grained level between different TM conflict detection and data versioning policies. We also aim to provide hybrid transaction partitioning, where transactions, which are known to fail in Hardware Transactional Memory (HTM) mode, are automatically executed in Software Transactional Memory (STM) mode.
Current research shows the benefits of having adaptivity in Software Transactional Memory runtime environments. Compared to the current state of the art we enhance the scope by running on a HybridTM system, additionally accounting for changing Hardware Transactional Memory behaviour and STM/HTM interaction during different application phases.
The transactional behaviour of the application is gathered during runtime using a low overhead profiling framework covering both Software and Hardware Transactional Memory modes. The gathered information is used to detect changing application phases. By applying heuristics we then decide dynamically which TM strategy and corresponding parameters are the best match for the current application phase in the context of the given HybridTM system.
Extensions to the Architecture and Microarchitecture of HTM
Introduction: Hardware transactional memory (HTM) has been proposed more than a decade ago and is now finding its way into commodity microprocessors.
We are currently investigating a number of extensions to these simplistic hardware designs and carefully craft them small to aid integration in existing HTM implementations.
Decomposing Lock Elision: Speculative lock elision (SLE) essentially converts critical sections protected by a lock into transactions executing in parallel.
For example, a simple test-and-set lock is changed as follows:
reg <- taken
ACQUIRE exchange(reg <-> *lock)
if (reg != free) goto spin_and_retry
RELEASE store(free -> *lock)
With the ACQUIRE / RELEASE primitives, the processor converts the critical section into a transaction. Furthermore, accesses to the lock change: instead of acquiring the memory for writing, only a local modification is made; and the processor tracks the value history of the lock and effectively compresses this value history cycle into a single read of the free value.
The net effect of these changes is that multiple concurrent critical sections can locally "acquire" the lock concurrently and do not abort each other due to write-after-write conflicts.
Unfortunately, these mechanisms are only exposed as a package. We propose to make each of these components available as explicit instructions, following the RISC principle. The separate availability allows programmers to write their own elision mechanisms more effectively, and can be applied in other transactional algorithms.
We can treat local stores as lazily versioned stores, they keep the transactions linearisable, but do not eagerly send out write probes which could abort other transactions. We propose to add a "lazy" prefix to allow the programmer to mark their stores, making this an opt-in feature, since lazy versioning may require additional hardware and thus may be constrained in capacity. Also, eager versioning may have beneficial effects on system performance.
Secondly, tracking the value history can be used to essentially squash a sequence of writes (e.g. store(A -> mem1), store(B -> mem1), store(C -> mem1) inside a single transaction to a single global store(C -> mem1)).
If that remaining store restores to the value read initially in the transaction, it can be dropped, too.
We propose to mark the initial load with a "cycle" prefix to enable the HW tracking of the value history of subsequent stores.
Session 3: Language Integration and Tools
Back to the Futures: the Futures-based Distributed Transactional Memory
Paweł T. Wojciechowski and Jan Kończak
Maintaining strong consistency in a large distributed system is expensive and difficult. The difficulty is explained by the CAP theorem which states that of three properties of such systems: consistency, availability, and partition tolerance, only two can be achieved simultaneously.
In our ongoing work, we propose distributed transactional memory (TM) which circumvents the CAP theorem by oﬀering stronger guarantees than those typically achieved by eventually consistent systems. Our goal is to allow transactions to access shared data simultaneously while maintaining strong consistency lazily. The key idea of our design stems from the fact that data modiﬁed or observed by transactions are often not used by the user (or application) immediately. As long as the real value of modiﬁed data is not required by the application that employs the transactional system, a symbolic value (or expression) can be stored and used within a transaction. The user is unaware whether the transaction is operating on a real value or on a symbolic value. However, the system makes sure that the user (or application) can only observe the results of transaction processing that satisfy required properties, i.e. atomicity, consistency, isolation, and (optionally) durability.
Consider the following example: A long-running transaction T1 operates on a counter. In the meantime, two concurrent short transactions T2 and T3 appear in the system in the given order. T2 just resets the counter. T3 increments and reads the counter. To guarantee opacity, T2 and T3 are typically rolled back or blocked waiting for T1’s commit (depending on the concurrency control scheme). Note that eventually T2 will commit and set the counter to 0 and then T3 will increment the value from 0 to 1 and ﬁnally display it. In our approach, T2 and T3 can proceed undisturbed and commit before T1. Notably, in some weakly consistent models the result of T3 can be unpredictable. In some systems, this example could even produce a result never occurring in any sequential history of transactions.
We designed a novel TM algorithm, in which the above problem is solved by reusing the concept of futures. To increase concurrency, transactions use futures to access shared data that may still be in use by some older transactions. Instead of performing transaction operations on concrete values, they are executed on futures, resembling lazy binding known from functional programming languages. However, as soon as the real value can be computed, the future expression is evaluated in order to produce the concrete value. In case a concrete value must be known (or exported) before a transaction commits, the transaction must wait for any values necessary to compute the expression (this is unavoidable unless we relax strong consistency). The main advantage of our approach is that transactions never run into conﬂicts. Thus, TM can guarantee opacity and strong consistency while, at the same time, reducing the dependencies between concurrent transactions.
Scalable self-tuning data placement for distributed transactional memory systems
Joao Paiva, Pedro Ruivo, Paolo Romano and Luis Rodrigues
This work addresses the problem of data placement in distributed transactional memory (DTM) systems. The goal is to place data in a way that leverages locality patterns in application's accesses, such that inter-node communication is minimized during transaction execution.
Classic approaches formulate the data placement problem as a constraint optimization problem, and use Integer Linear Programming techniques to identify the optimal placement strategy with the granularity of single objects. Unfortunately, this approach is non-scalable due to the large number of objects stored by a large scale DTM. Furthermore, it is challenging to maintain efficiently a (potentially extremely large) directory to store the mapping among objects and nodes.
Existing DTM systems tackle this issue by adopting two main strategies: random placement or dedicated directory services. Random placement solutions rely on simple, random hash functions to scatter data across nodes. These solutions allow lookups to be performed locally, in an efficient manner. However, due to the random nature of data placement, they may result in highly sub-optimal data placements. Systems relying on dedicated directory services achieve more flexibility in the mapping between objects and nodes. However, for scalability reasons most practical systems opt for coarse data placement strategies, rather than per-object relocation strategies. Furthermore, such solutions introduce additional round-trip delays along the critical execution path of data access operations, potentially hindering performance.
In this work we introduced AutoPlacer, a system that aims to improve data locality in a DTM system by moving objects to the nodes that most frequently access them, while keeping account their capacity constraints. To improve scalability, AutoPlacer relies on a distributed algorithm that operates in rounds, in each round optimizing the placement of a configurable number of objects that generate most remote accesses. We integrated AutoPlacer in Infinispan, a popular, open-source DTM and evaluated it using standard benchmarks. Results show that AutoPlacer can improve throughput up to a 6x factor when compared with an implementation using consistent hashing.
To achieve low bandwidth and storage costs for keeping the data placement mapping up to date, the system also makes use of an innovative data structure, named Probabilistic Associative Array (or PAA for short). PAAs expose the same interface of conventional associative arrays, but, in order to achieve space efficiency, they use probabilistic techniques and trade-off accuracy. This data structure allows our design to be more scalable than directory-based systems without sacrificing fine-grained placement for the most relevant objects.
Concurrent Message Processing using Transactional Memory in the Actor Model
Pascal Felber, Derin Harmanci, Yaroslav Hayduk and Anita Sobe
Parallelism can be supported by the means of message passing and/or shared memory depending on the type of the underlying architecture: distributed or multi-core systems. To simplify the developer's task, it is desirable to provide programming abstractions that support both trends and allow programs to be deployed on many multi-core machines. By seamlessly expanding beyond the single node boundaries, one can develop highly scalable concurrent programs.
The "actor" model has been successfully used for scalable computing in distributed systems where each machine has a single core. Actors perform local decisions based on incoming messages, i.e., the state of an actor can only be modified by the exchange of messages. One of the fundamental principles of actor models is to guarantee sequential message processing, which avoids typical concurrency hazards. A potential problem of this approach is that it limits the achievable throughput, and therefore the scalability. The semantics of the actor model, however, are necessary for program correctness. In this context, we propose to add support for speculative concurrent execution in actors, using transactional memory (TM) to protect the concurrent processing of messages.
Consider for example an actor that maintains some shared state. Other actors interact with the first one to read or manipulate the state. With sequential message processing, access to that state will be suboptimal when operations do not conflict (e.g., modifications to disjoint parts of the state, multiple read operations). The concurrent access to the shared state can be guaranteed safely in many cases, while in conflicting situations transactions provide support for rollback and restart.
We further see the potential for adaptability, i.e., if concurrent processing of messages is not possible due to dependencies, we can easily choose to fall back to a sequential execution. This flexibility allows the approach to adapt itself to varying workloads.
We provide a simple solution, using the Scala programming language and the Akka framework, that introduces only few syntactic changes for the programmer. Additionally, our approach inherently provides a simple error recovery mechanism. We investigate the overhead introduced by our approach and show that it is hidden by the improved throughput, and thus leads to an overall performance gain. We further identify the cases when the program should switch from speculative processing to sequential processing.
Post-Silicon Debugging of Transactional Memory Tests
Ophir Friedler, Wisam Kadry, Amir Nahir, Vitali Sokhin, Carla Ferreira and João Lourenço
A goal of the post-silicon validation effort is to detect and analyze the root cause of design functional bugs, which escaped the pre-silicon verification effort. The post-silicon validation process of a CPU comprises four interleaved elements: stimulating the design under test (DUT), detecting an erroneous behavior, determining the root cause of the problem, and providing a fix.
An exerciser that performs multi-pass consistency checking is a program that runs on the DUT and repeatedly generates and executes test-cases . The exerciser executes each test-case multiple times and compares the final results of those runs for consistency. When an inconsistency between the passes is detected, named mis-compare, the exerciser halts its execution and reports the list of inconsistent resources, that is, memory and CPU registers values.
Our goal is to develop a methodology to automate post-silicon debugging of transactional memory tests fails in the DUT by narrowing down the list of suspicious instructions in test-cases that result in a mis-compare. The methodology combines the use of a hardware exerciser to generate, execute, and check a vast number of test-cases. We propose to use the software reference model (also named golden model) as a vehicle to explore fail reasons and obtain observability into the architectural changes triggered by the test-case.
Based on a failure report extracted from hardware, we are capable of fast forwarding the reference model to the failing test. We execute the failed test instruction by instruction to build a resource dependency graph. Based on this graph, we utilize dynamic slicing techniques to determine the list of instructions that affect the mis-comparing resources at the end of the test. In addition, a graph-colouring based technique enables to reduce this instruction list to a smaller set of instructions that have high likelihood of triggering the mis-compare. Based on the information gathered in this process, such as the identity of the suspicious instruction, its location in memory and the operands it accesses, one can configure the hardware debug logic to trace relevant data, facilitating the precise identification of the root-cause of the bug and its fix.
 Allon Adir, Maxim Golubev, Shimon Landa, Amir Nahir, Gil Shurek, Vitali Sokhin, and Avi Ziv. 2011. Threadmill: a post-silicon exerciser for multi-threaded processors. In Proceedings of the 48th Design Automation Conference (DAC '11). ACM, New York, NY, USA, 860-865. DOI=10.1145/2024724.2024916 http://doi.acm.org/10.1145/2024724.2024916
Efficient Support of Reduction Operations in Transactional Memory
Miguel Angel Gonzalez-Mesa, Eladio Gutiérrez, Sonia Gonzalez and Oscar Plata
Today's shared-memory commodity computers, composed of multicore processors, require big efforts of programmers to develop parallel software in order to exploit efficiently the computing resources. In this context, reduction operations appear frequently in the core of many applications. In many cases, they are responsible of an important part of the overall computing time. Due to their importance, it is usual that parallel programming languages, like the OpenMP API, include specific syntax to efficiently parallelize this type of operations.
Modern compiler technology does a great job generating code from parallel reductions, using a combination of privatization techniques, atomic operations and lock and lock-free algorithms. However, current shared-memory multiprocessors, based on several multicore processors with complex, partially shared, cache hierarchies, make difficult to decide which machine code implementation aternative is best to be used in a particular application reduction loop.
In this work we propose to use transactions to implement parallel reductions. In principle, transactions may replace critical sections to protect the shared accesses to the reduction variable. However, if the contention in such variable is high, concurrent transactions conflict frequently and performance degrades. In order to solve this problem, we propose specific support for reductions in the TM system, basically introducing small modifications in the commit phase and in the data versioning mechanism. With this design, we observe a drastic reduction of transaction aborts as all conflicts due to the reductions are filtered out. Besides, our reduction-aware TM (R-TM) system allows programmers to introduce a new design dimension, the size of transactions (in terms of blocks of iterations of the reduction loop), that permits a finer control over the tradeoff between memory privatization and frequency of thread synchronization.
We compare the implementation of our proposal with that of a vanilla TM system and that of typical OpenMP compilers. The implementation of our design in a lightweight software TM system, TinySTM, and its evaluation on a set of benchmarks, proves the benefits of our proposal.
Session 4: Applications and Performance Evaluation
Using TM for high-performance Discrete-Event Simulation on multi-core architectures
I recently started to investigate how TM could possibly be used for optimizing the performance of a discrete-event simulation (DES) engine on a multi-core architecture. A DES engine needs to process events in chronological order. For this purpose, it needs an efficient data structure, typically abstracted as a heap or priority queue. Therefore, my goal is to design an optimized heap-like data structure supporting concurrent multi-thread access patterns, such that multiple events can be processed in parallel by multiple threads. In DES, traditional parallelization techniques fall in two categories: either conservative, or optimistic. In the conservative approach, events are dequeued and processed in strict chronological order, which requires a synchronization protocol between the concurrent logical processes (LPs), to ensure consistency. In the optimistic approach, LPs are free to proceed and possibly violate the chronological order, but in case such a violation happens, a roll-back mechanism is used to return to the last consistent state (which requires a snapshot). The solution I am currently investigating is based on a Software Emulation Library for C++ called TBoot.STM. This library offers various transaction semantics, among which one, called invalidate-on-commit, that allows a transaction to be invalidated by the process that "suffers" the violation rather than the one that originates it. In our case, assuming that a transaction is associated to the dequeuing and processing of an event, a transaction is deemed successful when it completes without any prior event to be inserted in the heap an no earlier event is still pending. This is where building a solution based on invalidate-on-commit and transaction composition seems promising: Indeed, it seems easier to discover chronological violation when new events are inserted. In that case, all transactions that where mistakenly started too early can be invalidated. This library also provides a way for composing transactions, which could also prove to be helpful. For example, an agressively optimistic strategy could dequeue new events before the full completion of earlier events in which case the composition could be used to make the completion of later events depend on the completion of earlier ones. I am still at an early stage of this work, for which I just started experiments and performance evaluations.
STurbo: Speeding up Conflicting Transactions
Jons-Tobias Wamhoff, Christof Fetzer and Pascal Felber
The most efficient software transactional memory (STM) implementations are based on versioned locks. In high contention workloads, locks that are already owned prevent other threads from making progress if they conflict on the same lock. Different contention management strategies have been proposed, but they have in common that the thread that looses the conflict must abort and retry the transaction. Without waiting for the winning thread to finish its transaction, the loosing thread has a high chance to run into the same conflict. In the worst case, this can result in a livelock.
Current processors can run above their base operating frequency within the boundaries of the thermal design power (TDP). Examples are Intel Turbo Boost and AMD Turbo Core. The original design of the manufacturers aims to keep the existence of the turbo mode transparent for the application. So called p-states are used to adjust the frequency, while each available p-state represents a specific frequency. If a sufficiently large number of cores is set to a low frequency because they are waiting, the TDP allows selected cores to run at a boosted (maximum) frequency. We propose a library that allows to explicitly adjust the frequency for individual cores using the p-state interface.
The library can be combined with existing STM implementations that support explicit contention managers in order to enable to wait for conflicting transactions to finish. The frequency for cores that execute waiting threads is reduced in order to save TDP. The thread that won the conflict can then run on a core with the maximum possible frequency. This way, the total time to wait for the completion of conflicting transactions can be reduced and the overall throughout is improved.
Our approach introduces only little additional overhead for adjusting the p-states. While it is ideal for high contention workloads, we can also enable the turbo to resolve peak loads and switch back to normal operation at the base operating frequency afterwards. During the wait time, threads are still able to operate at a lower speed, e.g., to perform some cleanup work.
Self-Tuning Replication in Distributed Transactional Memories
Maria Couceiro, Pedro Ruivo, Paolo Romano and Luis Rodrigues
Replication plays an essential role for distributed software transactional memory (DSTM) systems, given that it represents the primary mean to ensure data durability. Unfortunately, no single replication technique can ensure optimal performance across a wide range of workloads and system configurations. This presentation tackles this problem by introducing MORPHR, a framework that allows to automatically adapt the replication protocol of DSTMs according to the current operational conditions.
We present the results of a performance evaluation study using Infinispan, a popular, open-source DSTM, which we extended to support three different replication protocols: primary-backup, distributed locking based on two-phase commit, and atomic broadcast based certification. The results of our study highlight that none of these protocols can ensure optimal performance for all possible configurations, providing a strong argument to pursue the design of abstractions and mechanisms supporting the online reconfiguration of replication protocols.
We also introduce MORPHR, formalising a set of interfaces with precisely defined semantics which need to be exposed by an arbitrary replication protocol in order to support its online reconfiguration, i.e. switching to a different protocol. The proposed framework is designed to ensure both generality, by means of a protocol-agnostic generic reconfiguration protocol, and efficiency, whenever the cost of the transition between two specific replication protocols can be minimized by taking into account their intrinsic characteristics.
We validate the MORPHR framework, by integrating it in Infinispan, which allows the assessment of its practicality and efficiency in realistic DSTMs. A noteworthy result highlighted by our experiments is that the MORPHR-based version of Infinispan does not incur in perceivable performance overheads in absence of reconfigurations (which is expected to be the most frequent case), with respect to the non-reconfigurable version. We use this prototype to evaluate the latencies of generic and specialised reconfiguration techniques, demonstrating that the switching can be completed with a latency in the order of a few tens of milliseconds in a cluster of 10 nodes employing commodity-hardware.
And finally, we show how to model the problem of determining the optimal replication protocol given the current operational conditions as a classification problem. By means of an experimental study based on heterogeneous workloads and platform scales, we demonstrate that this learning problem can be solved with high accuracy.
Boosting Throughput in Replicated Software Transactional Memories
Tiago Vale, Ricardo Dias and João Lourenço
Software Transactional Memory (STM), an attractive solution to support concurrent accesses to in-memory storage, has already been embraced by some of the major CPU and compiler manufacturers. Enterprise-class STM-based applications have already been deployed in production systems . Typical enterprise-class applications are web-based and are deployed according to a producer-consumer architecture, where a front-end HTTP server forwards incoming requests to back-end application servers for processing. These real-world applications are usually read-intensive, and hold requirements such as scalability and reliability which are commonly tackled using replication.
Given the similarities between STM and database transactions, research on STM replication have borrowed inspiration from the literature in replicated databases. A handful of replication protocols [2, 3, 4, 5], commonly referred to as certification-based, have been proposed and evaluated in the context of replicated STM systems. Certification-based protocols execute transactions optimistically in a single replica. Replica consistency is ensured at commit time using a total-order broadcast to guarantee a common transaction serialisation order.
Unfortunately, the relative overhead introduced by the certification of distributed memory transactions is much higher than the overhead introduced by the certification of distributed database transactions. On the one hand memory transactions operate over main memory which is a very fast storage media, and on the other hand some key features of databases require additional processing time, such as SQL parsing, plan optimisation and secondary storage accesses. As observed in  and confirmed by our own experiments, the time taken to execute a transaction is one order of magnitude smaller than the time taken to certify that same transactions (μs vs. ms). This discrepancy leads to a severe under-utilisation of CPU time and other computational resources.
Reducing the transaction certification time has been the goal of previous works in certification-based STM replication. In this work we depart from this goal. Instead, we exploit the inevitable overhead imposed by transaction certification to boost the throughput of enterprise-class web applications.
The key idea of our technique stems from three fundamental ingredients of certification-based protocols for STM replication, namely that (1) transaction certification incurs in costly inter-replica coordination; (2) execution transactions is typically one order of magnitude smaller than certifying; and (3) read-only transaction are spared from certification. We leverage certification latency (the time between the broadcast and delivery of a transaction) by using the processing time previously wasted to execute pending read-only transactions, thus providing a free-of-charge throughput boost.
1. Instituto Superior Técnico: Fénix Edu. https://fenix-ashes.ist.utl.pt (2013)
2. Couceiro, M., Romano, P., Carvalho, N., Rodrigues, L.: D²STM: Dependable Distributed Software Transactional Memory. In: Proc. of the IEEE Pacific Rim International Symposium on Dependable Computing (PRDC). (2009) 307–313
3. Carvalho, N., Romano, P., Rodrigues, L.: Asynchronous Lease-Based Replication of Software Transactional Memory. In: Proc. of the ACM/IFIP/USENIX International Conference on Middleware (Middleware). (2010) 376–396
4. Carvalho, N., Romano, P., Rodrigues, L.: SCert: Speculative Certification in Replicated Software Transactional Memories. In: Proc. of the Annual International Systems and Storage Conference (SYSTOR). (2011)
5. Couceiro, M., Romano, P., Rodrigues, L.: PolyCert: Polymorphic Self-Optimizing Replication for In-Memory Transactional Grids. In: Proc. of the ACM/IFIP/USENIX International Conference on Middleware. (2011) 309–328
6. Romano, P., Carvalho, N., Rodrigues, L.: Towards Distributed Software Transactional Memory Systems. In: Proc. of the Workshop on Large-Scale Distributed Systems and Middleware (LADIS). (2008)
Combining on-line exploration and model driven performance forecasting to self-tune the parallelism level in Distributed Transactional Memory systems
Diego Didona, Pascal Felber, Derin Harmanci, Paolo Romano and Joerg Schenker
In this talk we present a novel methodology combining an exploration based approach with an analytical-model driven one in order to determine, at runtime, the optimal level of parallelism of Distributed Transactional Memory (DTM) applications, i.e., the number of nodes in the DTM platform and the number of active threads in each node which maximize the application's throughput.
The appealing characteristic of on-line exploration based approaches lies in its robustness, as the decision on the levels of parallelism to explore is taken by directly gathering the feedback from the application actually in execution. However, this kind of approaches would result in a slow convergence speed towards the optimal solution: in a DTM, in fact, the level of parallelism is given both by the number of nodes composing the platform and the number of active threads per node, resulting in a two-dimensional solution space.
On the other hand, a model-based approach is able to drive the search for the optimal multiprogramming level in a DTM without incurring in the overhead of exploring different configurations of number of nodes, i.e., avoiding the costly state transfers that adding/removing a node to/from the platform implies. However, the accuracy of analytical models can be impaired in scenarios in which the assumptions and approximations of the models are challenged.
The hybrid approach we propose results into the implementation of a controller which identifies the optimal level of parallelism for a DTM in the following way. Given a configuration in terms of number of nodes, it relies on local, inexpensive exploration of alternative multiprogramming levels in terms of active threads on each node. Such exploration is aimed at assessing the accuracy of the model while varying the concurrency level of the application, without triggering any state transfer. The feedback gathered comparing the prediction of the model with the actual performance delivered by the application allows the controller to train on-line a machine learner which is able to infer a corrective factor (which is a function of the concurrency level and of the workload) applied to the output of the model in order to improve its accuracy.
We assess the validity of the proposed approach via a simulation-based study based on real traces taken from the deployment of the TPC-C benchmark over Infinispan, a distributed transactional platform.