Category: SAS

Soner Onder and Dave Whalley Investigate Instruction-level Parallelism

From Florida State University News

A Florida State University researcher is working to make computer processors execute applications in a more energy-efficient manner with the help of a new $1.2 million grant from the National Science Foundation.

Professor Dave Whalley, Florida State University

“The general goal is to increase performance but to do it in a manner that is more energy efficient than the dominant computer processors that are in use today,” Professor of Computer Science David Whalley said.

To do that, Whalley and his colleague Soner Onder, a professor at Michigan Technological University, hope to more efficiently exploit what’s called instruction-level parallelism, or the ability of a computer to simultaneously execute multiple machine instructions.

Professor Soner Onder, Michigan Tech Department of Computer Science
Professor Soner Onder, Michigan Tech Department of Computer Science

“In general, VLIW processors are more energy efficient but cannot approach the performance of OoO processors except in limited domains, such as digital signal processing,” Whalley said.

Whalley’s project, called SCALE for Statically Controlled Asynchronous Lane Execution, is designed to overcome these current limitations. SCALE supports separate execution lanes, so that instructions in separate lanes can execute in parallel and dependencies between instructions in different lanes are identified by the compiler to synchronize these lanes when necessary.

“Providing distinct lanes of instructions allows the compiler to generate code for different modes of execution to adapt to the type of parallelism that is available at each point within an application,” Whalley said.

The grant began this fall and will run through August 2023. Half of the funding will come to Florida State, with the other half supporting Onder’s part of the work at Michigan Technological University. The FSU portion will support two graduate students in computer science.

Soner Onder Receives Year One Funding for $1.2M NSF SCALE Project

Soner Onder

Dave Whalley

Soner Onder, professor of computer science, was recently awarded $246,329 for the first year of a four-year NSF grant for his project, “SHF: Medium: Collaborative Research: Statically Controlled Asynchronous Lane Execution (SCALE).” The project is in collaboration with Prof. David Whalley of Florida State University. Michigan Tech is the lead institution in the project, it is expected to total $1.2 million, with Michigan Tech receiving $600,000.

Abstract: Enabling better performing systems benefits applications that span those running on mobile devices to large data applications running on data centers. The efficiency of most applications is still primarily affected by single thread performance. Instruction-level parallelism (ILP) speeds up programs by executing instructions of the program in parallel, with ‘superscalar’ processors achieving maximum performance. At the same time, energy efficiency is a key criteria to keep in mind as such speedup happens, with these two being conflicting criteria in system design. This project develops a Statically Controlled Asynchronous Lane Execution (SCALE) approach that has the potential to meet or exceed the performance of a traditional superscalar processor while approaching the energy efficiency of a very long instruction word (VLIW) processor. As implied by its name, the SCALE approach has the ability to scale to different types and levels of parallelism. The toolset and designs developed in this project will be available as open-source and will also have an impact on both education and research. The SCALE architectural and compiler techniques will be included in undergraduate and graduate curricula.

The SCALE approach supports separate asynchronous execution lanes where dependencies between instructions in different lanes are statically identified by the compiler to provide inter-lane synchronization. Providing distinct lanes of instructions allows the compiler to generate code for different modes of execution to adapt to the type of parallelism that is available at each point within an application. These execution modes include explicit packaging of parallel instructions, parallel and pipelined execution of loop iterations, single program multiple data (SPMD) execution, and independent multi-threading.

This award reflects NSF’s statutory mission and has been deemed worthy of support through evaluation using the Foundation’s intellectual merit and broader impacts review criteria.

https://www.nsf.gov/awardsearch/showAward?AWD_ID=1901005&HistoricalAwards=false

Soner Onder Presents Keynote at SAMOS XIX

Soner Onder
Soner Onder

Soner Onder (SAS), professor of computer science, presented a keynote lecture July 8, 2019, at the International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS XIX) on Samos Island, Greece, which was held July 7-11. Onder’s talk was titled, “Form Follows Function: The Case for Homogeneous Computer Architectures.” Onder also participated in the conference’s “The Annual Open Mike Panel.”

Keynote Lecture Abstract: ”Form follows function” is a principle associated with 20th-century modernist architecture and industrial design which says that the shape of a building or object should primarily relate to its intended function or purpose”[2]. For best performance in computer architecture, form must follow function as well. What are form and function in computer architecture? Form is easy to understand and interpret in its dictionary meaning; Function is not so clear-cut. In this talk, I will start with a simple problem, an algorithm, and a basic program representation that will be interpreted by the machine, and show that delivering high performance rests on solving only a handful, but fundamentally difficult problems. I will argue that the mere existence of domain specific solutions that general purpose computing cannot match in performance is a testament that the general purpose computing is ”not general enough”. What makes an architecture ”not general enough” is not the architecture itself, but rather the mismatch between the function its form had followed and the actual semantics of programs. To illustrate the point, I will challenge the widely understood interpretation of instruction-level parallelism (ILP) as ”single-thread performance”, and show that this interpretation is too short-sighted. We can efficiently exploit all types of available parallelism, including process-level, thread-level and data level parallelism, all at the instruction-level, and this approach is both feasible and necessary to combat the complexity that is plaguing our profession. I will then discuss why an executable single-assignment program representation [1] may be the ultimate function whose implementations may result in homogeneous general purpose architectures that can potentially match the performance of accelerators for specific tasks, while exceeding the performance of any accelerator traditional architecture combination for general tasks. I will conclude by discussing our results with Demand-driven Execution (DDE), whose form follows this single-assignment program representation.

About SAMOS (from http://samos-conference.com/): SAMOS is a unique conference. It deals with embedded systems (sort of) but that is not what makes it different. It brings together every year researchers from both academia and industry on the quiet and inspiring northern mountainside of the Mediterranean island of Samos, which in itself is different. But more importantly, it really fosters collaboration rather than competition. Formal and intensive technical sessions are only held in the mornings.A lively panel or distinguished keynote speaker ends the formal part of the day, and leads nicely into the afternoons and evenings — reserved for informal discussions, good food, and the inviting Aegean Sea. The conference papers will be published by Springer’s Lecture Notes in Computer Science – LNCS and will be included in the DBLP Database.

Samos Island, Greece

Samos Island, Greece

Soner Onder Presents Talk in Barcelona, Spain

Soner Onder is pictured on the right in the front.

Sonder Onder (SAS), professor of computer science, presented an invited talk at “Yale:80: Pushing the Envelope of Computing for the Future,” held July 1-2, 2019, in Barcelona, Spain. The workshop was organized by Universitat Politècnica de Catalunya in honor of the 80th birthday of Yale Patt, a prominent computer architecture researcher. Onder was one of 23 invitees to give a talk. His lecture was titled, “Program semantics meets architecture: What if we did not have branches?”

View the slides from Onder’s talk: Yale80-in-2019-Soner-Onder

Yale Patt is a professor in the Department of Electrical & Computer Engineering at The University of Texas at Austin, where he holds the Ernest Cockrell, Jr. Centennial Chair in Engineering. He also holds the title of University Distinguished Teaching Professor. Patt was elected to the National Academy of Engineering in 2014, among the highest professional distinctions bestowed upon an engineer. View Patt’s faculty webpage at: http://www.ece.utexas.edu/people/faculty/yale-patt.

Link to the workshop’s website here: http://research.ac.upc.edu/80-in-2019/

Visit the workshop’s Facebook page here: https://www.facebook.com/BSCCNS/posts/workshop-yale-80-in-2019pushing-the-envelope-of-computing-for-the-futurehttprese/2217508564992996/

Soner Onder, Barcelona, Spain

Soner Onder at Sagrada Família, Barcelona, Spain

Ali Ebnenasir is Co-author of Article in ACM Transactions on Computational Logic

Ali EbnenasirAli Ebnenasir (SAS/CS), professor of computer science, is co-author of the article, “On the verification of livelock-freedom and self-stabilization on parameterized rings,” published in the July 2019 issue of the journal ACM Transactions on Computational Logic. The article is co-authored by Alex Klinkhamer of Google.

Abstract: This article investigates the verification of livelock-freedom and self-stabilization on parameterized rings consisting of symmetric, constant space, deterministic, and self-disabling processes. The results of this article have a significant impact on several fields, including scalable distributed systems, resilient and self-* systems, and verification of parameterized systems. First, we identify necessary and sufficient local conditions for the existence of global livelocks in parameterized unidirectional rings with unbounded (but finite) number of processes under the interleaving semantics. Using a reduction from the periodic domino problem, we show that, in general, verifying livelock-freedom of parameterized unidirectional rings is undecidable (specifically, Π10-complete) even for constant space, deterministic, and self-disabling processes. This result implies that verifying self-stabilization for parameterized rings of self-disabling processes is also undecidable. We also show that verifying livelock-freedom and self-stabilization remain undecidable under (1) synchronous execution semantics, (2) the FIFO consistency model, and (3) any scheduling policy. We then present a new scope-based method for detecting and constructing livelocks in parameterized rings. The proposed semi-algorithm behind our scope-based verification is based on a novel paradigm for the detection of livelocks that totally circumvents state space exploration. Our experimental results on an implementation of the proposed semi-algorithm are very promising as we have found livelocks in parameterized rings in a few microseconds on a regular laptop. The results of this article have significant implications for scalable distributed systems with cyclic topologies.

Citation: Klinkhamer, A., & Ebnenasir, A. (2019). On the verification of livelock-freedom and self-stabilization on parameterized rings. ACM Transactions on Computational Logic, 20(3), 16:1-16:36. http://dx.doi.org/10.1145/3326456

MTU Digital Commons link: https://digitalcommons.mtu.edu/michigantech-p/146/

ACM link: https://dl.acm.org/citation.cfm?doid=3338853.3326456

Saeid Nooshabadi is Co-author of Article in Journal of Parallel and Distributed Computing

Saeid Nooshabadi

Saeid Nooshabadi (SAS/ECE), professor of electrical and computer engineering, is co-author of the article, “High-dimensional image descriptor matching using highly parallel KD-tree construction and approximate nearest neighbor search,” to be published in  the October 2019 issue of the Journal of Parallel and Distributed Computing, which is published by Elsevier. The article is co-authored by Michigan Tech Computer Science department doctoral candidate Linjia Hu.

Abstract: To overcome the high computational cost associated with the high-dimensional digital image descriptor matching, this paper presents a set of integrated parallel algorithms for the construction of K-dimensional tree (KD-tree) and P approximate nearest neighbor search (P-ANNS) on the modern massively parallel architectures (MPA). To improve the runtime performance of the P-ANNS, we propose an efficient sliding window for a parallel buffered P-ANNS on KD-tree to mitigate the high cost of global memory accesses. When applied to high dimensional real-world image descriptor datasets, the proposed KD-tree construction and the buffered P-ANNS algorithms are of comparable matching quality to the traditional sequential counterparts on CPU, while outperforming their serial CPU counterparts by speedup factors of up to 17 and 163, respectively. The algorithms are also studied for the performance impact factors to obtain the optimal runtime configurations for various datasets. Moreover, we verify the features of the parallel algorithms on typical 3D image matching scenarios. With the classical local image descriptor signature of histograms of orientations (SHOT) datasets, the parallel KD-tree construction and image descriptor matching can achieve up to 11 and 138-fold speedups, respectively.

Citation: Hu, L., & Nooshabadi, S. (2019). High-dimensional image descriptor matching using highly parallel KD-tree construction and approximate nearest neighbor search. Journal of Parallel and Distributed Computing, 132, 127-140. http://dx.doi.org/10.1016/j.jpdc.2019.06.003

MTU Digital Commons link: https://digitalcommons.mtu.edu/michigantech-p/145/

Elsevier link: https://www.sciencedirect.com/science/article/pii/S0743731519304319?via%3Dihub

Zhenlin Wang is Co-Author of Article in Parallel Programming Journal

Zhenlin Wang (SAS), professor of computer science, is co-author of the article, “Lightweight and accurate memory allocation in key-value cache,” published in the June 2019 issue of the International Journal of Parallel Programming, which is published by Springer.

Abstract: The use of key-value caches in modern web servers is becoming more and more ubiquitous. Representatively, Memcached as a widely used key-value cache system, originally intended for speeding up dynamic web applications by alleviating database load. One of the key factors affecting the performance of Memcached is the memory allocation among different item classes. How to obtain the most efficient partitioning scheme with low time and space consumption is a focus of attention. In this paper, we propose a lightweight and accurate memory allocation scheme in Memcached, by sampling access patterns, analyzing data locality, and reassigning the memory space. One early study on optimizing memory allocation is LAMA, which uses footprint-based MRC to optimize memory allocation in Memcached. However, LAMA does not model deletion operations in Memcached and its spatial overhead is quite large. We propose a method that consumes only 3% of LAMA space and can handle read, write and deletion operations. Moreover, evaluation results show that the average stable-state miss ratio is reduced by 15.0% and the average stable-state response time is reduced by 12.3% when comparing our method to LAMA.

Citation: Pan, C., Zhou, L., Luo, Y., Wang, X., & Wang, Z. (2019). Lightweight and accurate memory allocation in key-value cache. International Journal of Parallel Programming, 47(3), 451-466.http://dx.doi.org/10.1007/s10766-018-0616-4

Digital Commons link: https://digitalcommons.mtu.edu/michigantech-p/144/

Springer link: https://link.springer.com/article/10.1007%2Fs10766-018-0616-4

Ali Ebnenasir is Co-Author of Publication in ACM Transactions on Computational Logic

Ali Ebnenasir

An article co-authored by Ali Ebnenasir (SAS/CS) and Alex Klinkhamer, “Verification of Livelock-Freedom and Self-Stabilization on Parameterized Rings,” was recently published in ACM Transactions on Computational Logic.

Abstract: This article investigates the verification of livelock-freedom and self-stabilization on parameterized rings consisting of symmetric, constant space, deterministic, and self-disabling processes. The results of this article have a significant impact on several fields, including scalable distributed systems, resilient and self-* systems, and verification of parameterized systems. First, we identify necessary and sufficient local conditions for the existence of global livelocks in parameterized unidirectional rings with unbounded (but finite) number of processes under the interleaving semantics. Using a reduction from the periodic domino problem, we show that, in general, verifying livelock-freedom of parameterized unidirectional rings is undecidable (specifically, Π10-complete) even for constant space, deterministic, and self-disabling processes. This result implies that verifying self-stabilization for parameterized rings of self-disabling processes is also undecidable. We also show that verifying livelock-freedom and self-stabilization remain undecidable under (1) synchronous execution semantics, (2) the FIFO consistency model, and (3) any scheduling policy. We then present a new scope-based method for detecting and constructing livelocks in parameterized rings. The proposed semi-algorithm behind our scope-based verification is based on a novel paradigm for the detection of livelocks that totally circumvents state space exploration. Our experimental results on an implementation of the proposed semi-algorithm are very promising as we have found livelocks in parameterized rings in a few microseconds on a regular laptop. The results of this article have significant implications for scalable distributed systems with cyclic topologies.

https://dl.acm.org/citation.cfm?id=3326456&dl=ACM&coll=DL

doi: 10.1145/3326456

Scalable Spectral Sparsification of Graph Laplacians and Integrated Circuits

Circuit board

Researcher: Zhuo Feng, Associate Professor, Electrical and Computer Engineering

Sponsor: National Science Foundation: SHF: Small

Amount of Support: $450,000

Duration of Support: 3 years

Abstract: This research is motivated by investigations on scalable methods for design simplifications of nanoscale integrated circuits (ICs). This is to be achieved by extending the associated spectral graph sparsification framework to handle Laplacian-like matrices derived from general nonlinear IC modeling and simulation problems. The results from this research may prove to be key to the development of highly scalable computer-aided design algorithms for modeling, simulation, design, optimization, as well as verification of future nanoscale ICs that can easily involve multi-billions of circuit components. The algorithms and methodologies developed will be disseminated to leading technology companies that may include semiconductor and Electronic Design Automation companies as well as social and network companies, for potential industrial deployments.

Spectral graph sparsification aims to find an ultra-sparse subgraph (a.k.a. sparsifier) such that its Laplacian can well approximate the original one in terms of its eigenvalues and eigenvectors. Since spectrally similar subgraphs can approximately preserve the distances, much faster numerical and graph-based algorithms can be developed based on these “spectrally” sparsified networks. A nearly-linear complexity spectral graph sparsification algorithm is to be developed based on a spectral perturbation approach. The proposed method is highly scalable and thus can be immediately leveraged for the development of nearly-linear time sparse matrix solvers and spectral graph (data) partitioning (clustering) algorithms for large real-world graph problems in general. The results of the research may also influence a broad range of computer science and engineering problems related to complex system/network modeling, numerical linear algebra, optimization, machine learning, computational fluid dynamics, transportation and social networks, etc.

More details.