Title: Workload-Oblivious Self-Tuning of Intel TSX
Authors: Nuno Diegues and Paolo Romano
Abstract: The multi-core revolution that took place nearly one decade ago has turned parallel programming into a major concern for mainstream software developers. In this context, Transactional Memory has emerged as a simple and attractive paradigm to deal with synchronization in parallel programs, simply by requiring the developers to annotate atomic blocks (that shall run in transactions) in the program. Recently, this powerful abstraction has been made available in the instruction set of x86 processors in the Intel Haswell family, under the name of Intel TSX. 
In this paper we show that, although TSX is easy to use, its performance can vary significantly due to small configurations in its invocation via software. More than that, there does not seem to exist a simple static configuration that can perform good independently of the application and workload. As such, we present a set of lightweight techniques that allow to self-tune TSX in a workload-oblivious manner. This means that we do not require a priori knowledge of the application, and execute only runtime adaptation given sampling of the application behaviour. For this, we implemented our algorithm in GCC to compile transactional programs that use atomic keyword in blocks in the source code. We evaluated our solution in a comprehensive set of benchmarks and applications, and obtained consistent improvements over the available alternatives. Our changes are completely transparent from the programmer, as they require no effort from him, and reside completely hidden under GCC.

Title: Towards a more scalable Deferred Update Replication
Authors: Maciej Kokociński, Tadeusz Kobus and Paweł T. Wojciechowski
Abstract: In our work we research various replication techniques for clusters of a moderate size (10-15 nodes). Deferred Update Replication (DUR) based on Total Order Broadcast [1] is a well known optimistic concurrency control scheme for such systems. In DUR requests (transactions) are executed in isolation on copies of shared data objects and before committing undergo a certification procedure to ensure consistency. There are several reasons why DUR is particularly an interesting solution. Firstly, for each transaction’s run, in DUR processes synchronize only once, when the system tries to commit the transaction: after a transaction’s execution, the resulting updates together with metadata required for certification are broadcast to all nodes. Secondly, in DUR there is no central synchronization point. Even transaction certification is performed by each process independently. Limited synchronization between the processes as well as inherent tolerance of partial system failures make DUR work well in a distributed environment.
However, DUR is not without weaknesses. The very features that make DUR handle well partial system failures also limit its scalability. The more processes take part in the computation, the higher the network usage and the overhead caused by independent certification. Some of the existing approaches try to alleviate those weaknesses by, e.g., using only one process for transaction certification [2] or establishing leases for shared objects [3]. These approaches, however, trade the amount of network traffic and certification overhead for increased latency or weaker consistency guarantees. 
Our recent research focuses on DUR enhancements that alleviate the aforementioned limitations. In fact, our work concerns a class of algorithms that are made possible by allowing for closer integration between the broadcast protocol and the algorithm itself. We would like to share the various problems we encountered while devising our novel approach. Especially, we would like to focus on communication primitives such as Total Order Broadcast which, in some situations, provide insufficient guarantees. Finally, we would like to show some preliminary results of the enhanced DUR algorithm. 
[1] Fernando Pedone, Rachid Guerraoui and Andre Schiper, “Exploiting Atomic Broadcast in Replicated Databases”, Proc. of Euro-Par '98, 1998.
[2] Bettina Kemme and Gustavo Alonso, “Don't Be Lazy, Be Consistent: Postgres-R, a New Way to Implement Database Replication”, Proc. VLDB '00, 2000.
[3] Nuno Carvalho, Paolo Romano and Luís Rodrigues, “Asynchronous Lease-Based Replication of Software Transactional Memory”, Proc. of Middleware '10, 2010. 

Title: STMGC-C7: Fast Software Transactional Memory for Dynamic Languages
Authors: Armin Rigo and Remi Meier
Abstract: As part of the PyPy project, we explore the usage of transactional memory (TM) to enable parallelism for high-level, dynamic languages like Python or Ruby. 
Most current software TM (STM) systems suffer from a big overhead when they run on a single thread only (usually between 2-5x slowdown). They try to scale to a large number of CPUs for the benefit of parallelization to be greater than the penalty of the overhead. On the other hand, while also software-based, the system presented here initially focuses on a low CPU count (< 8). It uses an approach that can keep the single-thread overhead very low (initial experiments with a simple lisp interpreter suggest around 15%). As a consequence we already see great speed-ups over single-threaded, non-STM execution by only using 2 CPU cores. We achieve this with very-low-overhead read barriers and very-low-overhead fast paths of write barriers. The enabling mechanism, the Linux-only system call “remap_file_pages”, allows for creating several “views” on partially shared memory; every thread sees one of these views. 
Our goal is to support a mixture of short to very long transactions. We develop an object-based STM system with an integrated GC handling the typical high allocation rates of dynamic languages; in particular, it is a generational GC, and the write barrier combines GC and STM roles, always taking the fast path for recently-allocated objects. 
The next step is to finish integrating this system with PyPy, the Python interpreter in Python, and its Just-In-Time compiler. This is relatively straightforward and partly done already. We believe that the result has the potential to give good enough performance to rival or exceed the HTM experiments which have been done before on Ruby [1]. Future considerations also include optionally adding a hybrid (HyTM) component to our system. 
[1] Eliminating Global Interpreter Locks in Ruby through Hardware Transactional Memory, Rei Odaira, Jose G. Castanos and Hisanobu Tomari, PPoPP '14 Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 131-142.

Title: Dealing with Saturation in Signature-based Transactional Memory Systems
Authors: Ricardo Quislant, Eladio Gutiérrez, Sonia Gonzalez-Navarro and Oscar Plata
Abstract: Transactional memory (TM) is devised to ease the programming of shared memory parallel applications with the premise that transactions execute optimistically in parallel while the underlying system is in charge of managing potential memory conflicts. In TM systems, signatures have been proposed to keep track of the memory locations read and written by a transaction in a concise and unbounded way to help conflict detection. However, signatures are imperfect hash structures that can give rise to false conflicts due to memory location aliasing. Such conflicts are more likely as signature fills, to such an extent that false conflicts can lead a parallel application to perform worse than its sequential version. 
In the context of hardware TM, signatures are usually implemented as per-thread Bloom filters. Bloom filters are time and space-efficient hashing structures that allow an unlimited number of element insertions without the inconvenient of getting false positives on membership queries. In this work we propose a solution to avoid the negative performance effects of a high rate of false conflicts due to signature saturation. The aims of this study are twofold: (a) how to determine the level of signature saturation beyond which performance is degraded excessively; (b) how to act in order to avoid such performance degradation when the saturation threshold is reached. 
In this work, a serialization scheme is proposed that synchronizes the transactions running in parallel in the system whenever one of them has gone beyond a threshold of signature filling. Then, such a transaction keeps running while the others are stalled, thus reducing the probability of false conflicts. Results show that the performance of the parallel application is kept around the performance of the sequential version when conflicts due to signature saturation would lead the base TM system to perform much worse. From the experimental evaluation we find that a fixed threshold of signature filling works sufficiently well for a variety of transactional workloads.

Title: Improving STMs' performance via speculative parallelization
Authors: Hugo Rito and João Cachopo
Abstract: Software Transactional Memory (STM) is one promising abstraction to simplify the task of writing highly parallel applications that take full advantage of modern multicore machines. Yet, STMs' optimistic approach to concurrency control makes STMs impractical in high-contention scenarios where frequent transaction reexecutions hinder the progress, and, therefore, the performance of STM-based systems. 
In this work, we describe how STMs may take advantage of latent intra-transaction parallelism to improve STMs performance in conflict-intensive workloads using speculative parallelization. We developed an adaptive STM runtime that monitors contention in the system to detect periods of low inter-transaction parallelism when the STM should prioritize intra-transaction parallelism. Thus, as transactions start conflicting, our STM runtime automatically reduces the number of threads executing different transactions and uses the computational resources freed in the process to explore parallelism within the remaining in-flight transactions. 
Furthermore, we discuss how the information collected during a transaction's aborted execution may help the STM runtime choose the best tasks to execute speculatively in parallel. The idea is to use the transaction's previous read-set and write-set to identify tasks that usually do not conflict and, thus, are good candidates to run in parallel. In this context, two tasks of the same transaction do not conflict if the task that executes first in program order does not write something that the task that runs later in program order reads. Finally, we discuss the benefits of combining speculative parallelization with transaction scheduling and we show how per-transaction information may be used to execute conflicting tasks speculatively.

Title: Optimized Single Global Lock Fallback
Authors: Irina Calciu, Justin Gottschlich, Tatiana Shpeisman, Gilles Pokam and Maurice Herlihy
Abstract: Intel’s Haswell and IBM’s Blue Gene/Q and System Z are the first commercially available systems to include hardware transactional memory (HTM). However, practical HTMs are best-effort: they do not guarantee forward progress. Furthermore, most HTMs are bounded in size and support a restricted set of operations. Therefore, ensuring forward progress requires a software fallback. 
Hybrid transactional memory (HyTM) allows unsuccessful hardware transactions to revert to software. Existing HyTM proposals have drawbacks: memory accesses must be annotated, duplicate code is required for transactional and non-transactional paths, and it is not easily applied to legacy code. For many applications, however, it is simpler and more attractive to use a single global lock (SGL) mechanism, where all transactions that access a particular data structure synchronize through a single lock . Perhaps the most visible example of an SGL fallback scheme is Haswell’s hardware lock elision (HLE) [1], which supports a lock fallback directly through the instruction set architecture. SGL schemes are attractive because they can easily be retrofitted to legacy code, and they do not require code duplication. 
In both HLE, and HTM with SGL fallback, each hardware transaction starts by reading the lock’s state, called subscribing to the lock. Subscription ensure that any software transaction that subsequently acquires that lock will provoke a data conflict, ensuring correctness by forcing any active subscribing hardware transactions to abort. The duration of a lock subscription represents a “window of vulnerability” during which the arrival of a software transaction will prevent any subscribing hardware transactions from executing. 
In this work we present a novel optimization to the simple SGL fallback approach. One can significantly improve performance by performing lock subscription in a lazy manner: optimistically postponing reading the lock state for as long as possible (usually the very end of the transaction). Lazy subscription was first proposed in the context of Hybrid NOrec [2] to allow concurrent execution of multiple hardware transactions with the committing phase of a speculative software transaction. In our work, lazy subscription allows concurrent execution of multiple hardware transactions with a single non-speculative SGL transaction. The resulting mechanism maintains the simplicity and correctness of the original SGL fallback, but reduces its costs. Our results on the STAMP benchmark suite show that this approach reduces the rate of lock acquisitions compared to Haswell HLE and the original SGL fallback and improves performance on small and medium sized transactions that fit in the cache. 
[1] Intel Corporation. Hardware lock elision in Haswell. Retrieved from http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/composerxe/compiler/cpp-win/GUID-A462FBC8-37F2-490F-A68B-2FFA8010DEBC.htm. 
[2] L. Dalessandro, F. Carouge, S. White, Y. Lev, M. Moir, M. L. Scott, and M. F. Spear. Hybrid NOrec: a case study in the effectiveness of best effort hardware transactional memory. In Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems, ASPLOS XVI, pages 39–52, New York, NY, USA, 2011. ACM.

Title: On Breaching the Wall of Impossibility Results on Disjoint-Access Parallel STM
Authors: Sebastiano Peluso, Roberto Palmieri, Paolo Romano, Binoy Ravindran and Francesco Quaglia
Abstract: Over the last decade, Software Transactional Memory (STM) has emerged as an attractive paradigm for simplifying parallel programming. A property that is deemed as central for the scalability of a STM implementation is its ability to avoid any interference between transactions that access disjoint data sets – a property called disjoint-access parallelism (DAP).
Initial works [2, 8, 3, 4] in the area of DAP and STM addressed a number of fundamental questions, including impossibility results, on the cost of building STM systems achieving DAP, while ensuring popular consistency criteria such as Opacity and Snapshot Isolation. This body of literature concludes that an STM cannot be DAP, ensure wait-free and invisible read-only transactions (WFIRO), and guarantee Strict Serializability [2]. Since these works established a number of impossibility results related to the implementation of DAP STMs [2, 5, 8], in this talk, we intend to present a possibility result: an algorithm that achieves a set of desirable properties, i.e. DAP and WFIRO transactions, by embracing consistency criteria different from those typically targeted by existing STM algorithms (such as SI or Opacity).
More in detail, we seek an answer to the following question: considering Adya’s consistency hier- archy [1], what is the strongest criterion that allows for achieving DAP and WFIRO? This criterion is EPL-3U, also known as Extended Update Serializability (EUS) [7]. EUS ensures that: i) every read operation always observes a consistent transactional state, and ii) write-committed transactions are serializable. In addition, due to the impossibility result in [5], our algorithm guarantees weakly progressive write transactions [6].
By doing so, we also take into account the real-time order property and we reveal an impossibility result: the real-time order cannot be ensured in an STM that is weakly DAP, obstruction-free, and guarantees wait-free read-only transactions, irrespective of the visibility of read operations and the consistency level. Leveraging this impossibility result, we show that the lower bound on the visibility of read-only transactions in [2] is not a sufficient condition to guarantee Strict Serializability and hence Opacity. We prove also that, ensuring the real-time order is still impossible in a weakly DAP and weakly progressive STM that ensures WFIRO. Therefore, in our algorithm, we enforce real-time order to only conflicting transactions.
Finally, we show that, in the proposed algorithm, the spatial cost of each object version is optimal, i.e. it is a necessary cost for any algorithm that ensures EUS, DAP and WFIRO.

Title: TM+TLS offloading to Helper Threads
Authors: Ricardo Filipe and Joao Barreto
Abstract: Transactional Memory with Thread Level Speculation (TM+TLS) has been proposed as a new programming paradigm that achieves higher levels of parallelism than ever before, by dividing TM transactions into TLS tasks that will execute in parallel, each on its own hardware thread. 
The original TM+TLS algorithm, TLSTM, is very naïve on operations that require synchronization of a transaction's tasks. TLSTM also lacks good usage of idle task time. These are the main problems of TLSTM: 
Transaction validation is executed separately by each task 
Transaction abort requires that all of its tasks synchronize to abort the transaction atomically 
If a task of some transaction completes its execution early, it starts busy waiting until all of the transaction's tasks complete 
Several previous work on TM and TLS have used helper threads to offload expensive computation, while allowing for further progress of the worker threads running the application's code. We believe that by using helper threads we can improve TLSTM's transaction validation, commit and abort, which in turn allows for worker threads running tasks to progress after completing some application work. The main idea is to have a helper thread per application thread which centralizes the framework's TM and TLS operations, while worker threads running tasks mostly perform the application's computation. 
We have already started implementing this idea, by offloading the transaction's validation procedure onto a helper thread. We have a noticeable performance improvement already in the benchmarks used on the original TLSTM paper. However, that is just the start. We also plan to move the transaction commit and abort procedures to the helper thread of each application thread. These changes present some challenges when allowing worker threads to advance further than the current transaction being worked on by the helper thread. 
Do we allow worker threads to advance a set number of tasks or indefinitely if no abort signal is issued? It depends on the impact that allowing worker threads to advance has on the application's performance. Thread starvation may occur since we are basically extending the lifetime of an application thread's transaction. Either way, the abort procedure will now have to take into account all of the data that was affected between the transaction that is being validated by the helper thread and the transaction that the worker threads are executing, which may be several transactions away.

Title: Transaction Preserialisation in Transactional Memory
Authors: Tiago M. Vale, Ricardo J. Dias and João M. Lourenço
Abstract: Transactional Memory (TM) simplifies concurrent multi-threaded programming by delegating synchronisation to the runtime infrastructure. The programmer identifies code blocks that should run atomically---transactions---and a TM algorithm transparently manages concurrent accesses to shared state to guarantee correctness. 
In the context of TM it is important for correctness criteria to guarantee equivalence to a sequential execution. Informally this means that in spite of executing multiple transactions concurrently, the result is "as if" the transactions executed serially according to some order. Opacity is an example of such correctness criterion. 
The runtime infrastructure executes two tasks to guarantee correctness: (1) it defines a total order of all transactions; and (2) controls the concurrent execution of transactions to respect the defined total order. The total order of transactions describes the sequential execution that the controlled execution guarantees equivalence to. 
Existing TM algorithms perform both tasks during transaction execution, i.e., they simultaneously construct the total order and control concurrent execution. Serialising transactions during their execution has two consequences: (1) the process of constructing the total order is nondeterministic because it depends on the interleaving of transaction instructions generated by their multi-threaded execution; and (2) transactions may experience starvation by systematically aborting due to concurrent conflicting transactions. 
Although correctness is unaffected by these consequences, nondeterministic serialisation complicates debugging, testing, reproducing errors, and multi-threaded state machine replication, while starvation is simply suboptimal behaviour. 
In this work we decouple the task of serialising transactions from the TM algorithm. In this new model transactions are serialised before execution by an external process, and the TM algorithm just controls the concurrent execution of transactions to guarantee that the serialisation order is respected. 
We present a TM algorithm that respects opacity but does not serialise transactions; the algorithm relies on the serialisation order as an input and controls transaction execution accordingly. The algorithm also takes advantage of the predefined serialisation order to improve efficiency. 
Externalising the transaction serialisation logic has many interesting applications such as deterministic multi-threaded execution, record-replay execution, and multi-threaded state machine replication of TM programs. Alternatively the logic can be as simple as serialising transactions at start time, instead of commit time as typical TM algorithms. Serialising at start time has the benefit that transactions are starvation-free.

Title: Toward wait-free transactional memory with irrevocables
Authors: Jan Kończak, Rachid Guerraoui and Paweł T. Wojciechowski
Abstract: Transactional memory (TM) is an emerging new technology aiming to replace cumbersome locking mechanisms in concurrent systems. TM has also become a popular research topic and many experimental TM implementations appeared for the last decade. 
The implementations typically run transactions optimistically and resolve any conflicts among them using a transaction rollback mechanism. 
However, some operations, called irrevocable operations, can be neither rolled back, nor even compensated. For instance, many system calls and I/O operations have irreversible results. Disallowing such operations in transactions would severely limit TM applications. 
Thus, if transactional memory aims at replacing locks in real programs, it must embody mechanisms to support irrevocable transactions – transactions that guarantee no forced abort or rollback. Currently, to this end, optimistic approach is merged with pessimistic concurrency control. However, that causes transactions to wait. 
In our work, we show that a TM system with support for irrevocability can be completely wait-free without sacrificing its robustness. This result is useful for the research community since it fills a gap in our understanding of what can be achieved in TM systems. 
We point out that the main challenge for creating a robust wait-free TM with irrevocability is dealing with conflicts detected by an irrevocable transaction. 
This is crucial, as if such conflict arises, the irrevocable transaction must deal with it, even despite the other transaction may have started the commit, yet did not finish it. Keeping in mind that the irrevocable transaction must neither wait nor abort self, the problem becomes complex. 
We introduce two wait-free TM algorithms that solve the problem. The first one relies on the compare-and-swap (CAS) operation, a strong synchronization primitive, often required by TM implementations nowadays. However, CAS is expensive. That is why we show a second algorithm, which, even with primitives as weak as try-locks, is wait-free and supports irrevocability. 
To retain good performance, both algorithms have small impact on the TM when only revocable transactions are live. In presence of irrevocable transactions our algorithms allows high concurrency by restraining neither read nor update transactions from making progress and committing in parallel. Finally, we discuss the pros and cons of the two algorithms.

Title: Automatic Tuning of the Parallelism Degree in Hardware Transactional Memory
Authors: Diego Rughetti, Paolo Romano, Francesco Quaglia and Bruno Ciciani
Abstract: Transactional Memory (TM) is an emerging paradigm that promises to ease significantly the development of parallel applications. Performance of TM is, however, a more controversial issue. Due to its inherently speculative nature, in fact, TM can suffer of performance degradations in presence of conflict intensive workloads and excessively high degrees of parallelism. In such a context, one core issue to cope with is related to untapping available parallelism, while avoiding thrashing phenomena due to excessive data contention and high transaction abort rates. 
For the case of Software-based implementations of TM (STM), several approaches have been proposed to cope with thrashing avoidance. One of the key techniques exploited in these approaches consists in (dynamically) regulating the actual level of concurrency (number of active threads) while running the application. The goal of these techniques is to adapt concurrency to the level that fully exploit the intrinsic parallelism in the data-access pattern, which is in turn expected to lead to the optimal speedup while running the application. 
All these approaches rely on performance models (either white-box, e.g. analytic or black-box, e.g. machine-learning), which are used to predict the expected performance, depending on the application’s workload, while varying the number of threads. On the other hand, we are not aware of any study in literature investigating the issue of how to optimize the degree of parallelism in HTM systems. 
This talk discusses the results of an Short Term Scientific Mission funded by Euro-TM which was precisely aimed to fill this gap, whose relevance is particularly strong given the recent integration of HTM in mainstream processors. We will show that the problem of self-tuning the degree of parallelism in HTMbased systems cannot be effectively addressed by reusing techniques originally conceived to operate in STM contexts, due to two key reasons: 
– existing techniques targeting STM rely on models that do not consider transaction abort causes that are specific to HTM, and that are completely absent in STM systems. In particular, in HTM, a large number of transaction aborts is typically due to capacity constraints of processors’ caches, as well as to a plethora of different micro-architectural reasons; 
– STM-oriented approaches are typically based on software instrumentation and run-time monitoring of specific parameters whose values serve as input to instantiate performance models aimed to guide concurrency optimization. Monitoring these same parameters in the context of HTM is however unaffordable, as existing HTM implementations do not externalize them, and monitoring them at the software level would induce overheads analogous to those of implementing an STM, defeating the whole purpose of HTM. 
In the light of these considerations we propose a novel machine learning based technique to dynamically adapt the concurrency degree of HTM-based applications. The proposed self-tuning mechanism is explicitly designed to take into account the peculiarities of HTM systems, and to avoid the issues that affect existing STM-oriented solutions. Via an extensive experimental evaluation based on the well known STAMP benchmark suite, and on a HTM-equipped Intel Haswell processor (8 virtual cores - 4 physical with hyper-threading), we show that the proposed approach achieves, on average, twice the accuracy of existing methods, while imposing negligible overheads and abating learning times dramatically. 

Title: Towards Validating Software Transactional Memory Implementations
Authors: Martin Nowack and Christof Fetzer
Abstract: Transactional Memory (TM) became mainstream; different software-transactional memory implementations were developed and are nowadays available as libraries. Many libraries implement a common transactional memory application binary interface (TM-ABI) to standardize the access to TM libraries and make them exchangeable. For testing these libraries, different methods were developed which exercise the TM-ABI. Typically, tests call the libraries using different patterns to see if they work correctly. And to show that implementations are suitable for concurrent execution, test suites utilize parallel running transactions and interleave their calls to the library deterministically. Nevertheless, the internals of a single STM-library call provide more opportunities for possible different interleavings with other calls, which cannot necessarily be described nor triggered by the more coarse-grain interleavings based on calls to the interface. Moreover, they are specific to each TM library and cannot be reused. This might hide latent bugs in the implementation which are triggered by different workloads or machine configurations. 
This work presents a system, which systematically executes C/C++ STM-library implementations and which checks different interleavings based on the granularity of low-level instructions.
For this, we extended an existing Symbolic Execution engine to support transactions and low-level synchronization primitives. Furthermore, we enhanced it with different heuristics to mitigate the general problem of state explosion inherited by Symbolic Execution.
We show preliminary results from our system including bugs we found in STM implementations. 

Title: Robust DTM performance prediction by exploiting model diversity
Authors: Ennio Torre, Diego Didona and Paolo Romano
Abstract: Cloud computing is becoming the mainstream computing paradigm thanks to its ability of dispensing resource elastically, on demand, according to a pay-for-what-you use pricing scheme. Distributed Transactional Memory [1–4] is an attractive programming model for the Cloud [5, 6]: it simplifies the task of developing distributed applications by sheltering programmers from the burden of dealing with idiosyncrasies endemic to the Cloud, namely concurrency, faults, network latencies and the elasticity itself of the underlying environment.
The definition of a performance model for an application is a fundamental step towards the exploitation of the Cloud pricing model, as it allows to implement automatic resource provisioning scheme and self- tuning strategies aimed at minimizing operational costs while matching a specific Quality of Service. Unfortunately, building such a model for a DTM application is an extremely challenging task. This complexity naturally stems from the multidimensionality of both the tunable parameters space of DTMs and of the features space of transactional workloads. In fact, parameters like the number of nodes composing the platform or the data replication policy and workload characteristics, affect applications performance in a definitely non-linear fashion. This is imputable to the simultaneous, and often inter-twined, effects that such parameters have on computational resources (CPU, memory, network) and on logical contention on data (i.e., conflicts among transactions) [7–9].
State of the art solutions that address this problem are mainly based on Machine Learning (ML) and Analytical Modeling (AM). Both approaches come with virtues and limitations.
ML does not require building and validating a white-box model of a (D)TM system [10–16]. On the other hand, the accuracy of ML-based approaches strongly depends on the completeness and representativeness of the dataset used to train the ML algorithm: lacking knowledge on the actual dynamics of the modeled system, predictions targeting an area of the feature space not sufficiently explored during the training phase have typically very poor accuracy.
AM-based techniques [8, 17–19], conversely, requires no or minimal training, and, in order to instantiate the model, it typically necessitates solely of collecting a set of metrics necessary that can be gathered online from a production system. Concerning accuracy, AM normally achieves a good
overall accuracy. However, in order to ensure their tractability, AMs typically rely on a set of simplifying assumptions. Their accuracy can hence be seriously challenged in scenarios (i.e. areas of the feature space) in which the assumptions they are built upon are not matched.
In this talk we are going to present work in progress aimed at building innovative performance meta-predictor that combines the two approaches in order to get the best of the two worlds. When queried for the expected application performance for a given DTM configuration and a given workload, the meta-predictor collects the output of both ML and AM-based forecasting tools, and returns the result coming from the predictor with the highest expected accuracy in the parameters space region the input falls into. The choice on what predictor to trust is driven by a K-nearest-neighbour based gating scheme [20], that is trained online gathering feedback on the actual accuracy delivered of the underlying forecasting tools.
We present initial experimental results based on a synthetic benchmark- ing framework for DTM platforms that highlight the potential of the proposed methodology to combine effectively a state of the art AM- based modeling technique and a pure ML-based approach (relying on decision tree regressors [21]) and achieve higher levels of accuracy than the ones attainable using the two approaches individually.
1. Couceiro, M., Romano, P., Carvalho, N., Rodrigues, L.: D2stm: Dependable distributed software transactional memory. In: Proc. of Pacific Rim International Symposium on Dependable Computing. PRDC (2009)
2. Herlihy, M., Sun, Y.: Distributed transactional memory for metric-space networks. In: Proceedings of the 19th International Conference on Distributed Computing. DISC’05, Berlin, Heidelberg, Springer-Verlag (2005) 324–338
3. Turcu, A., Ravindran, B., Palmieri, R.: Hyflow2: A high performance distributed transactional memory framework in scala. In: Proceedings of the 2013 International Conference on Principles and Practices of Programming on the Java Platform: Virtual Machines, Languages, and Tools. PPPJ ’13, New York, NY, USA, ACM (2013) 79–88
4. Kobus, T., Kokocinski, M., Wojciechowski, P.T.: Hybrid replication: State- machine-based and deferred-update replication schemes combined. In: Proceedings of the 2013 IEEE 33rd International Conference on Distributed Computing Systems. ICDCS ’13, Washington, DC, USA, IEEE Computer Society (2013) 286–296
5. Romano, P., Carvalho, N., Rodrigues, L.: Towards distributed software transactional memory systems. In: Proceedings of the 2Nd Workshop on Large-Scale Distributed Systems and Middleware. LADIS ’08, New York, NY, USA, ACM (2008) 4:1–4:4
6. Romano, P., Rodrigues, L., Carvalho, N., Cachopo, J.: Cloud-tm: harnessing the cloud with distributed transactional memories. SIGOPS Oper. Syst. Rev. (2010)
7. Felber, P., Fetzer, C., Riegel, T.: Dynamic performance tuning of word-based software transactional memory. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. PPoPP ’08, New York, NY, USA, ACM (2008) 237–246
8. Didona, D., Romano, P., Peluso, S., Quaglia, F.: Transactional auto scaler: Elastic scaling of in-memory transactional data grids. In: Proceedings of the 9th Inter- national Conference on Autonomic Computing. ICAC ’12, New York, NY, USA, ACM (2012) 125–134
9. Didona, D., Felber, P., Harmanci, D., Romano, P., Schenker, J.: Identifying the optimal level of parallelism in transactional memory applications. Computing (2013) 1–21
10. Couceiro, M., Romano, P., Rodrigues, L.: Polycert: polymorphic self-optimizing replication for in-memory transactional grids. In: Proc. of the International Middleware Conference. Middleware (2011)
11. Couceiro, M., Ruivo, P., Romano, P., Rodrigues, L.: Score: a scalable one-copy serializable partial replication protocol. In: Proc. of the International Conference on Dependable Systems and Networks. DSN (2013)
12. Couceiro, M., Romano, P., Rodrigues, L.: A machine learning approach to performance prediction of total order broadcast protocols. In: SASO’10: Proceedings of the 4th IEEE International Conference on Self-Adaptive and Self-Organizing Systems, Budapest, Hungary (September 2010)
13. Castro, M., Goes, L.F.W., Ribeiro, C.P., Cole, M., Cintra, M., Mehaut, J.F.: A machine learning-based approach for thread mapping on transactional memory applications. In: Proceedings of the 2011 18th International Conference on High Performance Computing. HIPC ’11, Washington, DC, USA, IEEE Computer Society (2011) 1–10
14. Castro, M., Ges, L., Fernandes, L., Mhaut, J.F.: Dynamic thread mapping based on machine learning for transactional memory applications. In Kaklamanis, C., Papatheodorou, T., Spirakis, P., eds.: Euro-Par 2012 Parallel Processing. Volume 7484 of Lecture Notes in Computer Science. Springer Berlin Heidelberg (2012) 465–476
15. Rughetti, D., Di Sanzo, P., Ciciani, B., Quaglia, F.: Machine learning-based self- adjusting concurrency in software transactional memory systems. In: Proc. of the International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems. MASCOTS (2012)
16. Rughetti, D., Di Sanzo, P., Ciciani, B., Quaglia, F.: Auto-tuning of cloud-based in-memory transactional data grids via machine learning. In: Proc. of the International Symposium on Network Cloud Computing and Applications. NCCA (2012)
17. di Sanzo, P., Ciciani, B., Palmieri, R., Quaglia, F., Romano, P.: On the analytical modeling of concurrency control algorithms for software transactional memories: The case of commit-time-locking. Performance Evaluation (2012)
18. di Sanzo, P., Ciciani, B., Quaglia, F., Romano, P.: A performance model of multi- version concurrency control. In: Proc. of the International Symposium on Model- ing, Analysis and Simulation of Computer and Telecommunication Systems (MAS- COTS). (2008)
19. Heindl, A., Pokam, G.: An analytic framework for performance modeling of soft- ware transactional memory. Comput. Netw. (2009)
20. Russell, S.J., Norvig, P., Candy, J.F., Malik, J.M., Edwards, D.D.: Artificial intel- ligence: a modern approach. (1996)
21. Quinlan, J.R.: Rulequest Cubist. http://www.rulequest.com/cubist-info.html (2012) 

Title: Providing Transaction Class-Based QoS in in-Memory Data Grids Via Machine Learning
Authors: Pierangelo Di Sanzo, Francesco Molfese, Diego Rughetti and Bruno Ciciani
Abstract: In-memory transactional data grids have demonstrated to be particularly suited to take advantages of elastic cloud architectures, mainly thanks to their ability to be dynamically (re-)sized and tuned. Anyway, when specific QoS requirements have to be met, this kind of data platforms are complex to be managed by humans. Particularly, achieving optimal performance has revealed to be hard without the stand of mechanisms supporting run-time automatic sizing/tuning of the data platform and the underlying (virtual) hardware resources. In this research work, we developed a neural network-based architecture where an in-memory transactional data grid is constantly and automatically re-configured, specifically in terms of exploited computing resources, local concurrency level and global data replication factor. The goal of the re-configuration process supported by our architecture is to provide transaction class-based QoS while minimizing costs of the infrastructure. We also discuss some results showing the effectiveness of our architecture, which has been evaluated on top of Future Grid IaaS Cloud using Red Hat Infinispan in-memory data grid and the TPC-C benchmark.

Title: Towards Relaxing Software Transactional Memory
Authors: Christoph Kirsch and Michael Lippautz
Abstract: A key challenge when dealing with concurrent programs is maintaining correctness while still providing a fast implementation that potentially scales with the number of parallel processing units available. In the field of concurrent data structures correctness is typically defined by linearizablity [1] with respect to some kind of specification. However, providing a concrete data structure implementation that is linearizable with respect to a strict specification [2] often requires synchronization on a small number of variables (points of contention), limiting performance and scalability. A way out of this dilemma is to allow relaxed behavior that is still acceptable for specific applications. The relaxed behavior is introduced by providing additional levels of freedom in the specification [3], which are then used by concrete implementations to distribute synchronization among multiple points of contention. Examples include k-out-of-order relaxations on stacks [3] and queues [4] where elements may be inserted and removed out of order up to some bound k. Experiments show that relaxed data structures, e.g. FIFO-like queues, outperform and outscale existing state-of-the-art implementations. 
Software transactional memory (STM), also providing strong correctness conditions for programmers, suffers from similar performance degradation in workloads where multiple workers read from and write to same memory locations. Again, this is a result of synchronizing reads and writes (restricting them) in one way or another. The idea in relaxed STM is to not restrict concurrent transactions, but to let concurrent transactions create multiple copies of a location in memory, essentially diverging from one another. Applications settling at some steady state can tolerate this approximative behavior as diverged processing paths will reach consensus eventually. The resulting design may still follow the principles of STM (level of abstraction; ease-of-use), while being fast and scalable at the expense of some relaxed behavior. 
[1] M. Herlihy and N. Shavit. The Art of Multiprocessor Programming. Morgan Kaufmann Publishers Inc., San Francisco, CA, 2008. 
[2] H. Attiya, R. Guerraoui, D. Hendler, P. Kuznetsov, M. Michael, and M. Vechev. Laws of order: Expensive synchronization in concurrent algorithms cannot be eliminated. In Proc. POPL. ACM, 2011. 
[3] T.A. Henzinger, C.M. Kirsch, H. Payer, A. Sezgin, and A. Sokolova. Quantitative relaxation of concurrent data structures. In Proc. POPL. ACM, 2013. 
[4] A. Haas, T.A. Henzinger, C.M. Kirsch, M. Lippautz, H. Payer, A. Sezgin, and A. Sokolova. Distributed queues in shared memory: Multicore performance and scalability through quantitative relaxation. In Proc. CF. ACM, 2013.

Title: Adaptive Transactional Memories: Performance and Energy Consumption Tradeoffs
Authors: Diego Rughetti, Pierangelo Di Sanzo and Alessandro Pellegrin
Abstract: Energy efficiency is becoming a pressing issue, especially in large data centers where it entails, at the same time, a non-negligible management cost, an enhancement of hardware fault probability, and a significant environmental footprint. In this paper, we study how Software Transactional Memories (STM) can provide benefits on both power saving and the overall applications' execution performance. 
This is related to the fact that encapsulating shared-data accesses.
We have selected a set of self-adaptive extensions to existing STM middlewares (namely, TinySTM and R-STM) to prove how self-adapting computation can capture the actual degree of parallelism and/or logical contention on shared data in a better way, enhancing even more the intrinsic benefits provided by STM. Of course, this benefit comes at a cost, which is the actual execution time required by the proposed approaches to precisely tune the execution parameters for reducing power consumption and enhancing execution performance. Nevertheless, the results hereby provided show that adaptivity is a strictly necessary requirement to reduce energy consumption in STM systems: Without it, it is not possible to reach any acceptable level of energy efficiency at all.

Title: A Locality-Aware Software Transactional Memory
Authors: Patrick Marlier, Anita Sobe, Pierre Sutra
Abstract: The advent of chip level multiprocessing in commodity hardware has pushed applications to be more and more parallel in order to leverage the increase of computational power. However, the art of concurrent programming is known to be a difficult task, and new paradigms are required to help the programmer. Among those paradigms, software transactional memory (STM) is widely considered as a promising direction. The two key factors contributing to the popularity of STM are its simplicity and that STM is at heart a non-blocking technique.
The engine that orchestrates concurrent transactions run by the application, i.e., the concurrency manager, is one of the core aspects of a STM implementation design. A large number of concurrency manager implementations exists, ranging from pessimistic lock-based implementations to completely optimistic ones, with, or without, multi-version support. Because application workloads exhibit in general a high degree of parallelism, these designs tend to favor optimistic concurrency control. In particular, a widely accepted approach consists in executing tentatively the read operations and validating them on the course of the transaction execution to enforce consistency.
One of the very first STM design validates the read set at each step of the transaction, resulting in a quadratic-time validation complexity. More advanced techniques employ time-based validation. In a nutshell, this approach uses a global clock to track causality relations between transactions. When a transaction starts its execution, it retrieves a starting timestamp from the global clock, and re-validates its read set only if it encounters an object storing a higher timestamp. 
In every time-based STM, the critical path of a transaction contains at least two global operations. Global operations are expensive in multi-core/multiprocessors architecture, due to the synchronization wall. In particular, computer designs that exploit data locality with multiple cache levels, such as non-uniform memory architectures (NUMA), have to reduce the amount of global operations to improve application performance.
A few solutions exist to relieve time-based STM implementation from the usage of a global clock. However, to the best of our knowledge, they either increase storage cost, execute a high number of invalidations, or ensure weak consistency criteria. This talk addresses these shortcomings with a novel flexible STM design. Our implementation supports invisible reads, lazy snapshots, and it can be tailored to leverage data locality (from disjoint access parallelism to NUMA-awareness). Our design aims at maximizing the native workload parallelism, while leveraging locality of computer architectures to reduce the number of invalidations. Several experiments assess that our locality-aware design is in favorable cases close to the optimum, and that in the other ones, its performance are similar to existing STM solutions.