Session 1: Theoretical Foundations and Algorithms

Title: Verifying Transactional Programs with Programmer-Defined Conflict Detection
Authors: Omer Subasi, Tayfun Elmas, Adrian Cristal, Tim Harris, Serdar Tasiran, Ruben Titos-Gil, and Osman Unsal
Abstract: Transactional Memory (TM) simplifies concurrent programming by providing atomic, compositional blocks within which programmers can reason sequentially. In some programs, TM performance suffers because transactions conflict frequently. In response, TM implementers have developed mechanisms for programmer-directed relaxed conflict detection. The programmer indicates to the TM to ignore certain conflicts, the frequency of conflict is reduced and performance improves. However, the simplicity of sequential reasoning is lost. In this work, we present a method for modeling and verifying programs using TM with relaxed conflict detection.
In this work, we present a method for abstracting in a quantified manner the accesses on which conflicts are ignored. This abstraction and a soundness theorem we state and prove, allow the programmer to reason sequentially on an abstracted version of the transaction. Properties proved on this abstract but sequential version of the transaction carry over to the original program. We performed the proofs in two steps. First, we assumed a TM that does not ignore any conflicts and did a standard, contract-based sequential proof. (Here, contracts include invariants, pre- and post-conditions and loop invariants.) For this, we used VCC and HAVOC, state-of-the-art modular verification tools for C programs. Then, we applied the abstraction on the program annotated with contracts and applied the same tools. We found that the sequential verification of these benchmarks after the abstraction did not require extra or stronger contracts compared to the original sequential proof. This indicates that the arguments about why the transactions work under relaxed detection conflicts originates from the arguments about why these transactions work sequentially.

Title: FastLane: Streamlining Transactions for Low Thread Count
Authors: Jons-Tobias Wamhoff, Christof Fetzer, Pascal Felber, Etienne Riviere and Gilles Muller
Abstract: Software transactional memory (STM) can lead to scalable implementations of concurrent programs, as the relative performance of an application increases with the number of threads that support it. However, the absolute performance is typically impaired by the overheads of transaction management and instrumented accesses to shared memory. This often leads a STM-based program with a low thread count to perform worse than a sequential, non-instrumented version of the same application.
We propose FastLane, a new STM system that bridges the performance gap between sequential execution and classical STM algorithms when running on few cores. FastLane seeks to reduce instrumentation costs and thus performance degradation in its target operation range. We introduce a family of algorithms that differentiate between two types of threads: One thread (the master) is allowed to commit transactions without aborting, thus with minimal instrumentation and management costs, while other threads (the helpers) can commit transactions only when they do not conflict with the master. Helpers thus contribute to the application progress without impairing on the performance of the master.
FastLane is implemented within a state-of-the-art STM runtime and compiler. Multiple code paths are produced for execution on a single, few, and many cores. Applications can dynamically select a variant at runtime, depending on the number of cores available for execution. Preliminary evaluation results indicate that our approach provides promising performance at low thread counts: FastLane almost systematically wins over a classical STM in the 2-4 threads range, and often performs better than sequential execution of the non-instrumented version of the same application.
We will discuss how the performance can be improved by applying various optimizations. One interesting optimization and tuning aspect is the interference between master and helper threads. While the master thread should not degrade in throughput compared to the sequential baseline when contention is increased, the helper threads should have a chance to commit their share of the workload for better scalability. First results show that a good tradeoff can be achieved that enables FastLane to compete in scalability with low thread counts even in high contention workloads.

Title: Reusable Concurrent Data Types
Authors: Vincent Gramoli and Rachid Guerraoui
Abstract: Abstract data types (ADTs) have shown to be instrumental in making sequential programs reusable.  ADTs promote extensibility as one ADT can be specialized through inheritance by overloading or adding new methods, and composition as two ADTs can be combined into another ADT whose methods invoke the original ones.  Unfortunately, most ADTs that export concurrent methods, often called Concurrent Data Types (CDTs), are not reusable:  the programmer can hardly build upon them.
In fact, CDTs are typically optimized for exporting a fixed set of methods that are guaranteed to be "atomic", letting the programmer reason simply in terms of sequential accesses. Atomicity is, however, no longer preserved upon introduction of a new method that is susceptible of executing concurrently with existing ones. Often, the resulting semantics is far more complex than the sequential specification, making it diffcult to write correct concurrent programs. Two years ago, we reported a bug in the Java ConcurrentLinkedQueue CDT to the JSR166 group, namely the non-documented atomicity violation of its size method. In short, one cannot extend the Michael and Scottís queue [7] with a size method without modifying the original methods, a problem related to the inheritance anomaly [6]. Another bug, reported in [9, 1], outlines the difficulty of ensuring atomicity of a composite CDT: the Vector(Collection) constructor of the JDK 1.4.2 raises an ArrayIndexOutOfBoundsException.
In theory, there are solutions to address these limitations.  Universal constructions [4], CASN [2], transactional memory [5, 8] could play the role of method wrappers that preserve the atomicity of series of read/write accesses under composition [3]. In addition, a programmer invoking every method of an ADT through one of these wrappers guarantees the atomicity of these methods in any possible concurrent execution. Unfortunately, any of these wrappers must accommodate the method with the strongest requirement being thus overlying conservative for all other methods and limiting their concurrency.  For example, the wrapper should read all elements that are present in a sorted linked list at a common point of the execution when executing a size method, yet this requirement is unnecessary when executing a contains method.
We present the Polymorphic Transaction (PT) methodology, a methodology to construct reusable CDTs.  In short, PT exploits a pre-existing polymorphic set of transactional wrappers (compatible with each other) that preserve the atomicity and concurrency of methods to simplify concurrent programming.
[1]  Cormac Flanagan, Stephen N. Freund, Marina Lifshin, and Shaz Qadeer. Types for atomicity:  Static checking and inference for Java.  ACM Trans. Program. Lang. Syst., 30(4), 2008.
[2]  Tim Harris, Keir Fraser, and Ian A. Pratt. A practical multi-word compare-and-swap operation. In DISC, 2002.
[3]  Tim Harris,  Simon Marlow,  Simon Peyton-Jones,  and Maurice Herlihy. Composable memory transactions. In PPoPP, pages 48ñ60, 2005.
[4]  Maurice Herlihy. A methodology for implementing highly concurrent data objects. ACM Trans. Program. Lang. Syst., 15(5), 1993.
[5]  Maurice Herlihy and J. Eliot B. Moss. Transactional memory: architectural support for lock-free data structures. SIGARCHComput.Archit.News, 21(2), 1993.
[6]  Satoshi Matsuoka and Akinori Yonezawa. Analysis of inheritance anomaly in object-oriented concurrent programming languages.  In Research directions in concurrent object-oriented programming, pages 107-150. MIT Press, Cambridge, MA, USA, 1993.
[7]  Maged M. Michael and Michael L. Scott.  Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In PODC, 1996.
[8]  Nir Shavit and Dan Touitou.  Software transactional memory.  In PODC, 1995.
[9]  Liqiang Wang and Scott D. Stoller. Runtime analysis of atomicity for multithreaded programs. IEEE Trans. Softw. Eng., 32(2), 2006.

Title: OSARE: Opportunistic Speculation in  Actively REplicated Transactional Systems
Authors: Roberto Palmieri, Francesco Quaglia and Paolo Romano
Abstract: Active  replication  is  a  classical  means  for  providing  fault-tolerance  and  high availability. It relies on consensus among the replicas on a common total order for the processing of incoming requests, typically encapsulated by a so called Atomic Broadcast (AB) group communication primitive. In this presentation our focus is on active replication in the context of transaction processing systems, in which the group communication primitive used is an improvement of AB, that is Optimistic Atomic Broadcast (OAB). The OAB primitive provides that a notification of the final  message  delivery  order  is  preceded  by  an  optimistic  message  delivery indication.  By  activating  transactions  processing  upon  their  optimistic  delivery, rather  than  waiting  for  the  final  order  to  be  established,  OAB-based  replication techniques  overlap  the  (otherwise  sequential)  replica  synchronization  and  local computation phases. However, serializing transactions according to the optimistic delivery  order  does  not  pay-off  in  case  of  non-minimal  likelihood  of  mismatch between  optimistic  and  final  message  ordering.  In  such  a  case,  optimistically processed  transactions  may  have  to  be  aborted  and  restarted  right  after  OAB completion,  thus  nullifying  any  performance  gain  associated  with  their  early activation. In order to cope with the above issue, in this talk we present a novel active  replication  protocol  for  transactional  systems  based  on  an  opportunistic paradigm,  which  we  name  OSARE   -  Opportunistic  Speculation  in  Active REplication.  OSARE  maximizes  the  overlap  between  replica  coordination  and transaction  execution  phases  by  propagating,  in  a  speculative  fashion,  the (uncommitted) post-images of completely processed, but not yet finally delivered, transactions along chains of conflicting transactions. Also, speculation attempts to serialize  any  transaction  after  those  that  have  preceded  it  within  the  optimistic delivery order. In case the miss of some write is experienced along the execution path, which we refer to as a snapshot-miss, the materialized serialization order is opportunistically kept alive, with the aim of increasing the likelihood of matching the final order established by the OAB service in case it reveals not aligned with the optimistic delivery sequence. Further, if a transaction T experiences a snapshot miss,   OSARE   re-activates   a   new   instance   of   T   (and,   recursively,   of   the transactions having developed a read-from dependency from T), thus biasing the speculative exploration towards a serialization order compliant with the optimistic message  delivery  order.  Interestingly,  the  likelihood  of  snapshot-miss  events  is higher  in  high  concurrency  scenarios,  namely  when  the  interarrival  time  of optimistic deliveries is relatively short compared to transaction processing latency.
These  scenarios  are  precisely  those  in  which  the  probability  of  mismatches between  the  optimistic  and  final  message  delivery  orders,  and  consequently  the added  value  of  exploring  additional  speculative  serialization  orders,  are  higher. The ability of OSARE to adjust adaptively its degree of speculation on the basis of  the current level of concurrency represents a unique, innovative feature.

Title: A Multiversion Update-Serializable Protocol for Genuine Partial Data Replication
Authors: Sebastiano Peluso, Pedro Ruivo, Paolo Romano, Francesco Quaglia and Luis Rodrigues
Abstract: This talk addresses the problem of how to ensure strong consistency in distributed transactional memory systems in a scalable and efficient way. This is an extremely relevant topic of late, as the  advent  of  cloud  computing  has  led  to  the  proliferation  of  a  new  generation  of  in-memory transactional data platforms (sometimes referred to as NoSQL data grids [1], [2]), which rely on replication rather than persistence as the primary mechanism to achieve data durability. In these distributed  transactional  memory  platforms,  data  replication  plays  a  fundamental  role  for  both performance and fault-tolerance purposes, since it allows enhancing the throughput by distributing the load and ensuring data survival despite the occurrence of failures.
In  this  talk  we  present  GMU  (i.e.  Genuine  Multiversion  Update  serializable  replication),  a novel partial replication protocol for transactional systems designed to maximize scalability while ensuring strong consistency guarantees.
The core of GMU is a distributed multiversion concurrency control algorithm, which relies on a novel vector clock [3] based synchronization mechanism to track, in a totally decentralized (and consequently scalable) way, both data and causal dependency relations among transactions. GMU achieves high scalability by ensuring the, so called, genuineness property, i.e. by guaranteeing that only the replicas maintaining the data items accessed by a transaction are contacted in order to determine its final outcome (commit/abort). Further, performance are maximized by ensuring that read-only transactions are never aborted or forced to undergo any additional remote validation phase.
From the consistency perspective, GMU (as its name suggests) ensures the so called Update Serializability  (US)  criterion,  originally  introduced  in  [4]  and  further  investigated  in  [5].  US provides  a  semantic  equivalent  to  classic  1-Copy  Serializability  (1CS)  for  update transactions. Analogously to 1CS, with US read-only transactions are also guaranteed to observe a serializable snapshot,  namely  a  snapshot  equivalent  to  some  serial  execution  (formally,  a  linear  extension [3])  of  the  (partially  ordered)  history  of  update  transactions.  However,  unlike  1CS,  US  allow concurrent read-only transactions to observe snapshots generated from different, but compatible, linear  extensions  of  the  history  restricted  to  update  transactions.  As  a  consequence,  the  only discrepancies in the serialization orders observable by read-only transactions are imputable to the re-ordering of update transactions that neither conflict (directly or transitively) on data, nor are causally dependent. In other words, the only discrepancies perceivable by end-users are associated with the ordering of logically independent concurrent events, which has typically no impact on the correctness of a wide range of real-world applications [5].
The GMU protocol was integrated in a popular open source in-memory distributed transactional memory platforms for cloud environments, Infinispan [1] by Red-Hat/JBoss. On the basis of a large scale experimental study performed on heterogeneous experimental platforms and using industry standard benchmarks (namely TPC-C [6] and YCSB [7]), we  show that GMU achieves linear scalability and that it introduces negligible overheads (less than 10%) with respect to solutions ensuring non-serializable semantics in a wide range of workloads.
[1]  Red-Hat/JBoss, Infinispan. http://www.jboss.org/in?nispan.
[2]  Oracle Inc., Oracle Coherence Online Documentation Library Release 3.7.1, 2012.
[3]  L. Lamport, Time, clocks, and the ordering of events in a distributed system, Commun. ACM, vol. 21, pp. 558-565, July 1978.
[4]  R. Hansdah and L. Patnaik, Update serializability in locking, vol. 243 of Lecture Notes in Computer Science, pp. 171-185, Springer Berlin / Heidelberg, 1986.
[5]  A.  Adya,  Weak  consistency:  A  generalized  theory  and  optimistic  implementations  for  distributed transactions, tech. rep., PhD Thesis, MASSACHUSETTS INSTITUTE OF TECHNOLOGY, 1999.
[6]  TPC Council, TPC-C Benchmark. http://www.tpc.org/tpcc.
[7]  B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears, Benchmarking cloud serving systems with ycsb, in Proceedings of the 1st ACM symposium on Cloud computing, SoCC 10, 2010.

Session 2: Hardware Transactional Memory and Scheduling

Title: Enhancing an HTM System with Monitoring, Visualization and Analysis Capabilities
Authors: Philipp Kirchhofer, Martin Schindewolf, Nehir Sonmez, Oriol Arcas, Osman S. Unsal, Adrian Cristal, and Wolfgang Karl
Abstract: Parallel programming is essential to obtain the full performance of current ubiquitous multi-core architectures.  Transactional Memory (TM) is a new parallel programming paradigm trying to be easily understandable by programmers, have good scalability and deliver high performance. The programmer, who uses current Hardware Transactional Memory (HTM)systems, is often unaware of an application's behavior. This makes optimizing the application a trial-and-error process. As a consequence the TM applications do not run as efficiently as possible. A solution is to generate event logs with software. This method captures and preserves the dependencies between transactions that occur during run time. This approach typically comes with runtime overhead and may influence the application runtime characteristics. To get a suitable overview of HTM behavior it is vital to have a monitoring infrastructure with no impact on application runtime characteristics and application behavior. The optimization hints gathered could otherwise be influenced in some way and cause a misguided optimization attempt.
Our goal is to develop a zero overhead monitoring and visualization infrastructure for an existing HTM system. This allows the programmer to get insights into the interaction between application and HTM system, to detect bottlenecks in the program flow and to optimize the application for the underlying HTM system. The key design aspects of the monitoring infrastructure include multi-core-scalability, high extensibility, zero run-time overhead and therefore no influence on application run-time characteristics. The underlying HTM implementation is based on the TMbox system developed at the Barcelona Supercomputing Center (BSC) [2]. The developed monitoring infrastructure consists of hardware units that are attached to each processor core. These units monitor state changes, generate corresponding events, and send events to a dedicated memory area of the host system. This saves the events for later post-processing and visualization.
Visualization and Analysis is done using the Paraver tool, also developed at the BSC [1]. Paraver is used to show a colored timeline of the flow of different HTM states, where each color corresponds to a certain HTM state. This allows the viewer to easily get an overview of TM activity and characteristics.  Conflicting transactions are specially marked.
The monitoring infrastructure uses about 3 % more registers, 2 % more LUTs and 9 % more Block-RAMs than a system without the infrastructure.  The monitored application is not affected in its run-time timing characteristics. The HTM application behavior is traced and stored for post-processing.  Post-processing delivers several analysis metrics that allow to compare different implementations of an application and to optimize their overall run-time and scalability on a given HTM system. The monitoring infrastructure developed allows for an in-depth monitoring, visualization and analysis of HTM applications running on the TMbox platform. These novel features enable the programmer to identify the specific characteristics of different parts of an application and to detect parts with sub-optimal behavior.
[1]  Paraver website. http://www.bsc.es/paraver.
[2]  Nehir Sonmez, Oriol Arcas, Otto P?ucker, Osman S. Unsal, Adrian Cristal, Ibrahim Hur, Satnam Singh, and Mateo Valero. TMbox: A flexible and reconfigurable 16-core hybrid transactional memory system. In Proc. FCCM 2011, pages 146-153.

Title: Towards Fair Transaction Scheduling with Competing QoS Requirements
Authors: Walther Maldonado Moreira, Pascal Felber, Gilles Muller, Julia Lawall and Etienne Riviere
Abstract: Transactional Memory (TM) is a concurrency control mechanism which follows the idea that "apologizing is better than asking for permission." That is, transactions are allowed to execute concurrently, and if there is a conflict (due to shared memory accessed simultaneously by multiple transactions) one or more transactions must restart execution, as many times as needed to make progress. However, this behaviour means that TM is unsuitable for time-bounded situations, as a transaction has no guarantee of how long it will take to complete or how many times it will need to restart.
Our previous work has shown that, by using different transaction executing mechanisms (each with different performance/commit-predictability tradeoffs) and recording elapsed time it is possible to commit a transaction of interest using a Quality of Service measure (i.e.: percentage of the transactions that must commit before the specified deadline), in which the QoS is met in most cases, even in cases with high contention.
Our current work deals with extending support towards multiple deadlines representing potentially competing QoS requirements. Dealing with multiple deadlines leads to several interesting challenges and design decisions. For one, it may not be possible to meet the deadline required under load, in which case the system must degrade overall performance in a "fair" way (without priviledging some transactions/threads over others). Additionally, ensuring commit of a transaction can cause additional delays to other threads, thus a greedy algorithm that does not consider the state of the other threads can lead to a situation where each thread is starving the others, heavily impacting performance.
Upcoming result is the development of a smarter scheduling algorithm, which can allow as many deadline-enabled transactions to commit on time, while still degrading gracefully on the presence of contention.

Title: Transaction Concurrency Control via Dynamic Scheduling Based on Static Analysis
Authors:  Pawel T. Wojciechowski, Konrad Siek
Abstract: We develop Atomic RMI-a concurrency control library built on top of Java RMI which ensures serializable access to distributed objects in Java by allowing the developer to define specific sequences of method calls on remote objects as distributed transactions. The system uses pessimistic concurrency control algorithms that suspend method calls when necessary to satisfy isolation (serializability). Recent work on software transactional memory (STM) demonstrated advantages of a fully pessimistic model. In our system, each non-faulty transaction is executed once and never aborts. Uncommitted transactions can be rolled back at any point, either explicitly (by the programmer) or implicitly (if a node crashes). Thus, the programmer is offered a convenient mechanism to express distributed atomic transactions in which shared data can be any objects declared as 'remote' (in the sense of Java RMI) and accessed by any of their methods.
We present a novel pessimistic concurrency control algorithm, which implements fine-grain locking that ensures transaction isolation and deadlock-freedom via dynamic scheduling of object calls. In order to schedule the execution of method calls efficiently, the algorithms require some input data about a transaction to be known a priori--i.e. before the transaction is spawned. More precisely, the developer needs to find the number of times a particular object will be accessed during run-time, or at least an upper bound on that number. If an object's method were called an unknown number of times due to loops or recursion, it is possible to mark the last use of the remote object by the transaction and to free it on that basis at run-time.
To ease programming and ensure safety, we developed an analysis algorithm for the prediction of upper bounds or the last use of objects by a transaction. The algorithm takes the form of interprocedural data flow and region-based analyses of an intermediate language. We employ this algorithm in a precompiler that instruments source code for use with Atomic RMI. We demonstrate that the upper bounds can also be used to achieve better throughput by scheduling whole transactions. We designed and implemented an extension of the concurrency control algorithm with transaction scheduling. Transactions are dynamically scheduled in terms of estimated length. We will show the results of experimental comparison of Java RMI with a global lock and Atomic RMI with and without transaction scheduling or specified upper bounds showing a performance increase in variations of the former in cases of high contention.

Title: Accelerating conflict-intensive applications
Authors: Hugo Rito and João Cachopo
Abstract: STMs collect large amounts of information about a transaction's execution, but they use this information only to validate the transaction and then discard it. A key observation, however, is that typically a piece of code of a program is executed many times, giving rise to several similar transactions.  For that reason, the information collected during the execution of a piece of code - a transaction - can be used to improve the performance of its future executions - other transactions created by the same piece of code.
In our work we address the problem of STMs' poor performance in conflict-intensive workloads by making effective use of all the information collected during transaction execution to implement, almost for free, optimizations that would otherwise be very expensive to implement outside the context of STMs.
Currently, we are exploring how memoization can be incorporated into an STM runtime to decrease the time spent on transaction reexecution. Memoization, in its original formulation, is a technique that improves the performance of repeated executions of read-only methods and, for that reason, is frequently used to improve the performance of programs in read-dominated workloads. Nonetheless, in the transactional context, memoization can be very useful in write-dominated workloads: Memoizing methods called inside a transaction may avoid repeated computations on transaction reexecution by using valid information stored in the memo-cache.
We propose to extend STM transactions with a per-transaction memo-cache that stores information about methods executed inside the transaction. This way, if the transaction subsequently restarts, the transaction's memo-cache may be used to accelerate the reexecution of all memoized methods with valid memo-cache entries. Because memo-cache operations may be very expensive, the idea is to execute each transaction in two threads: A main thread responsible for executing the application code, and a sibling thread responsible for searching the memo-cache and for collecting memo information.
Furthermore, information collected by one transaction may be used to accelerate future executions of the same transaction or even to improve the performance of other transactions, as long as transactions share common memoized methods and transactions incorporate valid memoization information in a central, system-wide memo-cache for the benefit of all running transactions.
Early results showed that large transactional applications can benefit a lot from memoization because they execute expensive operations repeatedly over the same data. As expected, we obtained the best results with read-dominated workloads, yet memoization showed promising results in write-dominated workloads as well.

Title: Digging parallelism out of high-conflicting workloads
Authors: Nuno Diegues and João Cachopo
Abstract: When an application's workload is mostly write-dominated, the use of optimistic concurrency control mechanisms may result in too many conflicts. This is often the case for Transactional Memory systems, which ultimately may need to restart all but one transaction executing at any given time. In such scenarios, the parallelization of a system typically reduces its performance, rather than increasing it. Yet, sometimes these operations could be executed faster (thus reducing the window of opportunity for conflicts) if their latent parallelism was used efficiently.
Our claim is that the parallelization of transactions can often increase the throughput and reduce the latency of a system's operations. The key idea is that, in a scenario of conflicting transactions, we may internally parallelize each of the contending transactions and run them one at a time, therefore reducing the time to complete each transaction and eliminating the conflicts between them. This approach works best when code parallelized at the top-level conflicts with a high probability, whereas the parallelization of each transaction results in mostly independent subtasks. In the general case, however, we cannot guarantee that the parallelization of each top-level transaction produces a set of conflict-free parallel subtasks. So, we need to execute these parallel subtasks within (nested) transactions, for which reason parallel nesting is crucial to achieve what we propose here.
Moreover, depending both on the amount of parallelism and on the probability of conflicts among subtasks of a top-level transaction, we may not be able to occupy all of the available processors with only subtasks of a single top-level transaction. But, in fact, running only one top-level transaction at a time is not needed: We may run concurrently other (parts of) top-level transactions, provided that they do not conflict with the remaining executing tasks, in order to obtain an increase in performance. Thus, it becomes clear that an efficient transaction scheduling policy becomes specially important in this scenario.
Unfortunately, few TM systems allow a transaction to be split in several concurrent subtasks, or make use of a transaction scheduling policy. To tackle this problem, we extended a multi-version, lock-free, Software TM to show that parallel nesting and transaction scheduling may be used to increase the performance in highly-contending, write-dominated workloads, by exploring the latent parallelism within top-level transactions.

Session 3: Language Integration and Tools

Title: Leveraging Transactional Memory to Extract Parallelism
Authors: Miguel A. Gonzalez-Mesa, Eladio Gutierrez and Oscar Plata
Abstract: Application performance is benefiting from the availability of multiple cores in modern processor chips. This benefit is subject to the efficient use of multithreading on a given architecture. In many situations, application performance depends strongly on the time spent in iterative computations, like loops. The efficiency obtained when parallelizing a loop lies in the correct resolution of dependencies which is frequently very conservative or even not possible at compile time.
In this context, the optimistic concurrency exploited by transactional memory systems may help to parallelize iterative computations. In this work, a low-overhead transactional memory support is proposed in order to extract parallelism when no data dependence information is available before runtime. Our proposal allows executing blocks of iterations in parallel whose data dependencies are preserved through a fully distributed transaction commit management and by means of an eager (conflict detection) - lazy (data versioning) model. In addition, some optimizations, like transactional data forwarding, were developed to improve the basic proposal.
At language level, our model extracts parallelism from an iterative computations by simply adding a directive at the beginning of the section. The iteration space is partitioned (in some way) by the compiler, defining each block as a transaction, which are distributed among the available threads. In addition, all memory reads and writes are translated into transactional reads and writes. These transactions are executed concurrently and out-of-order, exploiting maximum parallelism, but with specific commit control to fulfill data dependences. To accomplish correct parallel execution, transactions keep private OIDs (Order Identifiers), derived from the sequential ordering of the computation. When a conflict is eagerly detected, OIDs of involved transactions are used to determine the type of data dependence causing such conflict. False dependences are fulfilled by just enforcing commit ordering (according to OIDs). True dependences, in other hand, stall the writing transaction until the reading one commits. However, we allow the reading transaction to forward the conflicting data to the consumer transaction, reducing stall time at the cost of complicating commit management.
In our model we have also implemented a checkpoint mechanism that allows an executed but not yet committed transaction to release conflict/versioning data structures (read/write sets and write buffer) in order to allow starting a new transaction with a higher OID. This technique enables a thread to launch new transactions although other transactions with older OIDs are still uncommitted. Data forwarding permits correct execution progress of these new transactions. When one of these transactions commits successfully, the commit manager tries also to commit older pending transactions.
This approach has been implemented in a light-weight software transactional memory system, TinySTM, and is being evaluated on a set of benchmarks. Preliminary results on a synthetic benchmark computing the transpose of a sparse matrix show an important reduction in the number of transaction aborts regarding the baseline TinySTM. In addition, we measured the TCR (transaction commit rate), which is the percentage of committed transactions out of all executed transactions in a sample period. This parameter is a light-weight, application independent measure of exploited parallelism. In our approach, we observed the following results (taken about the middle of the execution time in a real machine):
TCR ( 4 threads, TinySTM): ~30% ; ( 4 threads, our approach): ~85%
TCR ( 8 threads, TinySTM): ~15% ; ( 8 threads, our approach): ~60%
TCR (16 threads, TinySTM): ~10% ; (16 threads, our approach): ~35%

Title: Applying Dataflow and Transactions to Lee Routing
Authors: Chris Seaton, Daniel Goodman, Mikel Luján, and Ian Watson
Abstract: Functional and deterministic models for creating parallelism such as dataflow are elegant and easy to reason about. However they can struggle to efficiently implement programs where there is a component of conceptually shared state or 'don't-care non-determinism'. Such programs can often be implemented more efficiently with shared memory, which can be achieved safely with atomic access as implemented by transactional memory.
We consider whether a combination of dataflow and transactional memory would allow us to achieve the benefits of both approaches. Dataflow provides for creation of parallelism and control flow that is functional and deterministic where possible. Where access to shared state is needed for efficiency, transactional memory can be used to make this access atomic and so compatible with the dataflow model of implicit parallelism and scheduling, where the programmer is not expected to know or care when two or more nodes are run in parallel.
To explore this idea we have extended the Scala language with constructs for dataflow programming and support for transactional memory. The Scala language gives us a hybrid foundation that already accepts both functional programming with immutable data structures, but also imperative programming with shared state. Our extensions allow for creating and implicitly scheduling a dataflow graph. Dataflow nodes can contain atomic blocks of instructions that provide composable atomic, consistent and weakly isolated access to shared memory without special typing or modification to existing data structures or libraries.
To evaluate the benefits of this approach we have implemented Lee's algorithm for circuit routing. This algorithm has properties that make it particularly suited to demonstrating the strengths of our combination of dataflow and transactions. It is highly irregular, with dependencies between potentially parallel parts of the program being unclear until they are actually executed. Its updates to shared state are often non-local and wide reaching, making a traditional division of data structures ineffective in exposing parallelism. It also allows for don't-care non-determinism and has multiple valid solutions when run in parallel.
We show how our approach applied to Lee's algorithm improves programability in terms of the required volume of code and the number of operations related to parallelism that only distract from the pure expression of the algorithm. Our implementation also achieves a real performance increase over a traditional implementation using coarse locks, and an implementation using only transactional memory.

Title: Objects with adaptive accessors to avoid STM barriers
Authors: Fernando Miguel Carvalho and João Cachopo
Abstract: In previous work, we proposed a new multi-versioning STM---adaptive object metadata (henceforth AOM for short)---that reduces substantially both the memory and the performance overheads associated with transactional locations that are not under contention. A multi-versioning STM has the benefit that read-only transactions never conflict with other transactions, but the drawback of typically requiring both more memory and extra indirections to access the value of each transactional location. The AOM solution has the benefits of the multi-versioning approach, but simultaneously reduces its overheads by swinging objects back and forth between two different layouts: a compact layout, where no memory overheads exist, and an extended layout, used when the object may be under contention.
AOM is an object-based design that follows the JVSTM general design, but it is adaptive because the metadata used for each transactional object changes over time, depending on how objects are accessed. We provided preliminary results that show an improvement in the performance and a reduction in the memory overheads for workloads where the number of objects written is much lower than the total number of transactional objects, such as the WormBench and the Lee-TM. Although these results confirm our expectations, this first implementation of the AOM still suffers from two limitations: (1) the results evidence some lack of scalability due to the use of a single global lock to protect the write-back phase; and (2) the need for runtime verifications such as checking the objects' headers on every access to transactional objects.
To address these two problems, we implemented a new version of the AOM that is based on the lock-free version of the JVSTM and we eliminated all the overheads of accessing objects in the compact layout during read-only transactions. To make the contention-free execution path free of any STM barrier, we duplicated the accessors of the transactional classes, so that one accesses directly the object fields and another uses STM barriers. Transactional objects may switch between these two accessors implementations according to the same principles that rule the layout transitions between the compact and extended layouts. To switch the behavior of accessors in the most efficient way, we use the new bytecode invoke dynamic available in Java 7, which supports efficient and flexible execution of method invocations in the absence of static type information.
These two enhancements contribute to achieve better scalability and improved performance for the WormBench and Lee-TM workloads. We have also tested the new release of the AOM in the Vacation benchmark and the results reinforce our previous conclusions.

Title: Supporting In-Place Metadata in DeuceSTM
Authors: Ricardo J. Dias and João M. Lourenço
Abstract: Several Software Transactional Memory (STM) algorithms have been developed in the last few years. These algorithms differ in many different properties, such as progress and safety guarantees, contention management, etc. Some STM frameworks (DSTM2 [1], DeuceSTM [2]) were developed to allow the implementation and comparison of many different algorithms using the same transactional interface.
The information required by an STM algorithm is either associated with the execution of a transaction (frequently referred as transaction descriptor), or associated with each memory location (or object reference) accessed within the transaction. The transaction descriptor is typically stored in a thread-local memory space and maintains the information necessary to validate and commit a memory transaction. The information associated with each memory location depends on the nature of the STM algorithm, which we will henceforth refer to as metadata, and may be composed by e.g. locks, timestamps or version lists. Since metadata is associated with each memory location, there are two main strategies regarding the location for storing such metadata: The metadata is stored either near each memory location (in-place strategy); or in a mapping table that associates the metadata with the corresponding memory location (out- place strategy).
DeuceSTM is one of the most efficient STM frameworks available for the Java programming language and provides a well defined interface that allows to implement several STM algorithms using an out-place strategy.
This work describes the adaptation and extension of DeuceSTM to support the in-place metadata strategy. Our approach complies to the following properties:
Efficiency --- The STM implementation does not rely on an auxiliary mapping table, thus providing faster direct access to the transactional metadata; Non-transactional code is oblivious to the presence of metadata in objects, hence there is no performance overhead whatsoever for non-transactional code; Transactional code avoids the extra memory dereference imposed by the decorator pattern; Primitive types are fully supported, even in transactional code; And we provide a solution for fully supporting transactional N-dimensional arrays, which again impose no overhead on non-transactional code.
Transparency --- The STM implementation automatically identifies, creates and initializes the necessary additional metadata fields in the objects; Non-transactional code is oblivious to the presence of metadata in objects, hence no code changes are required; The new transactional array types (supporting metadata for individual cells) are compatible with the standard arrays, hence not requiring any pre-/post- processing of the arrays when invoking standard or third-party non-transactional libraries.
Performance benchmarking has confirmed that our approach for supporting in-place metadata in DeuceSTM, alongside with the existing out-place strategy, is a valid functional complement, widening the range of algorithms that can be efficiently implemented in DeuceSTM, enabling their fair comparison in a common infrastructure.
[1] Herlihy, M., Luchangco, V., Moir, M.: A flexible framework for implementing software transactional memory. In: Proceedings of the 21st annual ACM SIGPLAN conference on Object-oriented programming systems, languages, and applications. pp. 253-262. OOPSLA ‚2006, ACM, New York, NY, USA (2006)
[2] Korland, G., Shavit, N., Felber, P.: Noninvasive concurrency with Java STM. In: MultiProg 2010: Programmability Issues for Heterogeneous Multicores (2010), http://www.deucestm.org/

Title: Highly parallel programming with Transactional Memory
Authors: Ricardo Filipe and João Barreto
Abstract: Each year, affordable multicore machines are including more and more cores. This trend calls for concurrent programs that expose enough parallelism to maximize the utilization of such increasing computing resources. Yet, concurrent programs are significantly more difficult to code than sequential ones, especially for inexpert programmers.
Although the emerging paradigm of Transactional Memory (TM) promises to simplify concurrent programming, it still exhibits key limitations when the goal is to expose high levels of parallelism. TM encapsulates all the hidden issues of fine-grained concurrency control in a simple abstraction, the transaction. With TM, the role of the programmer is reduced to forking/joining threads and creating transactions for synchronizing critical sections. However, not only is thread forking a non-transparent burden on the programmer, but it also requires a strong comprehension of the underlying application semantics. Therefore, the programmer will often create programs that exhibit lower levels of thread-level parallelism than what would be theoretically possible.
State of the art Software TMs (STM) have very limited support for parallelizing the execution of nested transactions. This hinders the expert programmer who is able to fork threads inside previously forked threads. This lack of support is all the more worrying if we believe that, eventually, most libraries will have functions that create transactions inside them to synchronize critical sections. A novice programmer may use these libraries in his code and end up with nested transactions for several levels of depth, without his knowledge.
Most STMs only run nested transactions serially and do not support parallel nested transactions at all. Some support only one level of parallel nesting, while others support any level of parallel nesting but have linearly increasing overhead. Other STMs have constant time overhead but treat every transactional operation as a write operation, thus hindering read dominated workloads. Most of these approaches require visible readers to work properly, which has been shown to be an undesirable property for an STM.
Our contribution will be developing an STM that supports any level of parallel nested transactions, with constant time overhead for read-dominated and read-write workloads, while retaining the properties proven most interesting of the underlying STM, such as invisible readers and on commit updates.
We intend to tackle some complex TM benchmarks, since those are the ones which would profit the most from further parallelism. Most of the existing TM complex benchmarks represent applications where internal loops or recursive functions are common and can be further parallelized.

Session 4: Applications and Performance Evaluation

Title: Dynamic Thread Mapping Based on Machine Learning for Transactional Memory Applications
Authors: Márcio Castro, Luis Fabricio Góes, Luiz Gustavo Fernandes, Jean-François Méhaut
Abstract: Thread mapping is an appealing approach to efficiently exploit the potential of modern chip-multiprocessors by making better use of cores and memory hierarchy. It allows multithreaded applications to amortize memory latency and/or reduce memory contention. However, efficient thread mapping relies upon matching the behavior of an application with system characteristics.
Software Transactional Memory (STM) appears as a promising concurrency control mechanism for those modern chip-multiprocessors. It allows programmers to write parallel code as transactions, which are guaranteed to execute atomically and in isolation regardless of eventual data races. However, applications can behave differently depending on the characteristics of the underlying STM system. Thus, the prediction of a suitable thread mapping strategy for a specific application/STM system becomes a daunting task.
Our previous work focused on a machine learning-based approach to statically infer a suitable thread mapping strategy for transactional memory applications. This means that the predicted thread mapping strategy is applied once at the beginning and does not change during the execution of the application. We demonstrated that this approach improved the performance of all STAMP applications, since most of the transactions within each application usually have very similar behavior.
We have constantly seen efforts for a wider adoption of Transactional Memory (TM). For instance, the next version of the GNU Compiler Collection (GCC 4.7) will support TM primitives and new BlueGene/Q processors will have hardware support for TM. Thus, it is expected that more complex applications will make use of TM in a near future. In those cases, static thread mapping will no longer improve the performance of those applications, emerging the necessity of dynamic or adaptive approaches.
In this paper, we propose a dynamic approach to do efficient thread mapping on STM applications composed of more diverse workloads. These workloads may go through different execution phases, each phase with potentially different transactional characteristics. At runtime, we gather useful information from the application, STM system and platform at specific intervals. At the end of each interval, we rely on a decision tree previously generated by a Machine Learning (ML) algorithm to decide if the current thread mapping strategy should be switched to a more adequate one. This dynamic approach was implemented within TinySTM as a module, so the core of TinySTM remains unchanged and it is transparent to the user. Our results show that the dynamic approach is up to 31% better than the best static thread mapping for those applications.

Title: On the Natural Degree of Parallelism in Transactional Memory: from Centralized to Distributed Architectures
Authors: Diego Didona, Pascal Felber, Derin Harmanci, Paolo Romano, Jörg Schenker
Abstract: Transactional memory (TM) has been widely studied over the last decade as it provides a scalable and easy- to-use alternative to locks. One of the key results highlighted by existing research is that, independently of the nature of the synchronization scheme adopted by a TM platform, its actual performance is strongly workload dependent on and affected by a number of complex, often intertwined factors (e.g. duration of transactions, level of data contention, ratio of update vs read-only transactions).
This talk will focus on the issue of automatically identifying the "natural" degree of parallelism of an application, i.e., the workload-specific threshold below which increasing concurrency will improve transactions throughput and over which additional concurrency will not help and might even degrade performance because of higher contention and abort rates, even if sufficient physical resources are available.
We will focus on the importance of adapting the concurrency level to the workload in various application settings and we will provide empirical evidence of the impact that it has on the delivered performance, presenting results taken from two very different scenarios: a shared-memory system with a low-level STM library written in C, and a distributed system with a high-level DSTM infrastructure written in Java.
We will overview two alternative self-tuning methodologies, based either on on-line exploration methods or on model-driven performance forecasting techniques, and analyze the pros and cons of the two approaches when employed in centralized vs distributed TM settings.
Finally, we will present ongoing work aimed at combining exploration-based and model-driven methods in order to take the best of the two worlds, namely exploiting on-line exploration to improve models accuracy, while using models to improve the scalability of on-line techniques. We present experimental results that back our claims in both centralized and distributed settings, and open the way to further research in adaptive mechanisms for elastic scaling of TM within and across nodes.

Title: Performance Analysis of and Tool Support for Transactional Memory on BG/Q
Authors: Martin Schindewolf, Martin Schulz, Barna Bihari, John Gyllenhaal, Amy Wang, and Wolfgang Karl
Abstract:  We studied the performance of the TM subsystem of BG/Q as well as researched the possibilities for tool support for TM.
To study the performance, we run CLOMP-TM. CLOMP-TM is a benchmark designed for the purpose to quantify the overhead of OpenMP and compare different synchronization primitives. To advance CLOMP-TM, we added Message Passing Interface (MPI) routines for a hybrid parallelization. This enables to run multiple MPI tasks, each running OpenMP, on one node. With these enhancements, a bene?cial MPI task to OpenMP thread ratio is determined.  Further, the synchronization primitives are ranked as a function of the application characteristics.   To demonstrate the usefulness of these results,  we investigate a real Monte Carlo simulation called Monte Carlo Benchmark (MCB). Applying the lessons learned yields the best task to thread ratio. Further, we were able to tune the synchronization by transactifying the MCB.
Further, we develop tools that capture the performance of the TM run time system and present it to the applicationís developer. The performance of the TM run time system relies on the built-in statistics. These tools use the Blue Gene Performance Monitoring (BGPM) interface to correlate the statistics from the TM run time system with performance counter values. This combination provides detailed insights in the run time behavior of the application and enables to track down the cause of degraded performance. Further,  one  tool has  been implemented that  separates  the  performance counters in three categories:  Successful Speculation, Unsuccessful Speculation and No Speculation.
All of the tools are crafted around IBMís xlc compiler for C and C++ and have been run and tested on a Q32 early access system.

Title: Improving Application Fault-Tolerance with Diverse Component Replication
Authors: João Soares and Nuno Preguiça
Abstract: Software systems/applications are often released with bugs, which commonly results in faulty behavior. Replication has often been used for dealing with fail-stop faults, while diversity allows systems to deal with additional faulty behaviour. Our goal is to offer developers fault-tolerant components with minimal overhead, thus improving the overall fault-tolerance of developed applications.
Our on-going work provides run-time fault detection and prevention, on single machine multi-core systems.
We define a new abstraction, called Macro-Component, designed to detect/prevent faults, by encapsulating diverse implementations of the same specification (i.e., diverse replicas with similar interfaces). Faults are detected by executing operations called on a Macro-Component (MC-operations) in all replicas, and comparing the set of obtained results (replicas that return results that contradict a majority are considered faulty).
To this end it is essential to preserve the causal order of MC-operations in all replicas, and to guarantee that operations are executed in the same order in all replicas. While a sequential approach, where an MC-operation returns only after being executed in every replica, guarantees this, it restricts performance, especially for multi-threaded applications. To address this, our design allows MC-operations to execute concurrently in all replicas while guaranteeing the preservation of the serialization order of the operations in all replicas.
To this end, we map MC-operations into groups of transactions (one transaction for each replica), and allow them to execute concurrently, leaving the STM engine (Deuce using TL2) to guaranty their serializability. To preserve replica consistency, we built our solution on a modified STM engine design with additional states - pre-validation, post-validation, post-commit, pre-abort and post-abort. This allows transactions (in the same or concurrent groups) to run in parallel on all replicas without forcing an a priori execution order. This prevents faster transactions from being held by slower ones.
Transaction groups are total ordered during commit phase. After validation, a transaction will verify if its group has already been assigned with a commit order. If this is true, then the transaction must preserve this order (aborting if necessary). If this is false (meaning this is the first transaction, of its group, to commit), an order is assigned to the group and the transaction commits. This way, all transactions in a group will commit in the same order in all replicas, thus preserving consistency.

Title: Differentiated Access to Virtual Resources in Cloud Environments
Authors: M. Fazio and A. Puliafito
Abstract: Resources provisioning is a very important issue in Cloud computing, i.e. how resources may be virtualized, organized, allocated and managed in order to deliver high quality services. In a realistic scenario, a Cloud is composed by several resource Providers, which provide the virtual resources, many Clients, which are the users or applications requesting resources, and some resource Brokers, which are  able to handle client requests and permits only authorized users to access resources in the virtual infrastructures. The strategic role of Brokers in the Cloud is proved by the emerging business products developed for cloud brokering (Kaavo, RightScale, Cloudkick...). However, it implies several challenging issues in concurrent attempts of accessing the resources, both within the same Broker, in order to increase the efficiency in the management of many Client requests at a time, and among several Brokers, in order to support the resource sharing. To this aim, we propose the adoption of Software Transactional Memory (STM) to simplify the concurrent management of shared-resources. We believe that STM represents a very helpful technology to provide a lightweight mechanism for synchronizing the activities of many Brokers, which have access to virtual resources of several providers, alleviating many of the problems associated with the locking of resources.
The application of STM in Cloud environment is not trivial and several challenging issues need to be investigated. First, an appropriate representation of virtual resources is necessary for applying classic STM techniques. Traditional Contention Managers do not well suit the management of resources in the Cloud, since their behavior should be strictly related with business policies and SLA agreements among Clients, Brokers and Providers. To address realistic and complex scenarios, STM solutions could be coupled with other access control systems.
In our presentation we propose a new architecture able to support the provisioning of heterogeneous resources and services in a complex business model. It performs the intermediation services atop existing cloud platforms to guarantee resource provisioning for applications/services. Also, it allows to implicitly implement the aggregation task (deploying customer services over multiple cloud platforms), since the resources given by Providers are aggregated in a single framework. Of course, clever allocation policies can take care of both resource availability, business agreements and specific features of Providers, so to provide also the arbitrage task (opportunistic choices and foster competition between Providers).
We investigate the proposed architecture within the Cloud@Home project, which proposes a new volunteer computing paradigm, where the infrastructure is obtained by merging heterogeneous resources offered by different domains and/or providers. The architecture is implemented making use of Deuce, a flexible runtime environment for Java STM. Also, it manages Client requests through XACML policies, in order to offer a differentiated resource provisioning service, where transactions on virtual resources depends on specific SLA.