This is a short description of major projects I led or participated in. Please refer to my publications page for details
Speculative Lock Allocation for Chip Multiprocessors (LBNL): This project attempts to exploit the sharing patterns of cache lines and locks in modern chip multiprocessors to assist data movement. It attacks performance bottlenecks created by overused cache lines and locks, which will become more critical with the simple cores projected for future many-core chips.
TiDA: Tiling as a Durable Abstraction (LBNL): Tiling is a useful loop transformation for expressing parallelism and data locality. Automated tiling transformations that preserve data-locality are increasingly important due to hardware trends towards massive parallelism and the increasing costs of data movement relative to the cost of computing. This work develops TiDA as a durable tiling abstraction that centralizes parameterized tiling information within array data types with minimal changes to the source code. The data layout information can be used by the compiler and runtime to automatically manage parallelism, optimize data locality, and schedule tasks intelligently.
Collective memory transfers for multi-core chips (LBNL): Performance improvements for a vast number of parallel applications depend on efficient access of neighboring data stored in main memory by a collection of cores on chip. A perfect example is the broad class of data parallel applications and kernels, which require domain-decomposition of data arrays from a contiguous arrangement in memory to a tiled layout for on-chip L1 data caches and scratchpads. However, DRAM performance suffers under the non-streaming access patterns generated by many independent cores. DRAM performance and power are both crucial for a wide variety of chips. This project proposes collective memory scheduling (CMS) that actively takes control of collective memory transfers such that requests arrive in a sequential and predictable fashion to the memory controller. CMS maximizes memory throughput and reduces power beyond that attainable by even the most sophisticated memory controllers. This mechanism can be implemented in hardware-managed cache-coherent chip multiprocessors in the form of a last-level cache prefetcher , or with software assistance in architectures such as local stores using DMA operations. Paper
Variable-Width Datapath for On-Chip Network Static Power Reduction (LBNL): With the tight power budgets in modern large-scale chips and the unpredictability of application traffic, on-chip network designers are faced with the dilemma of designing for worst-case bandwidth demands and incurring high static power overheads, or designing for an average traffic pattern and risk degrading performance. This work proposes adaptive bandwidth networks which reduce the static power of on-chip networks by dividing channels, buffers, and crossbars into lanes and activating only the number of lanes necessary in each hop to meet traffic demands. ABNs also take advantage of drowsy SRAMs to eliminate false input VC buffer activations. In addition, ABNs readily apply to silicon defect tolerance with just the extra cost for detecting faults. Technical Report.
Extending Summation Precision for Network Reduction Operations (LBNL): Summation computational precision is at the core of numerous important algorithms such as Newton-Krylov methods and other operations involving inner products. To avoid precision loss and the accumulation of rounding errors in operations with millions or billions of operands, in this work I propose big integer (BigInt) expansions of double precision variables that enable arbitrarily large summations without error and provide exact and reproducible results for distributed (system-wide) summations. This is feasible with performance comparable to that of double-precision foating point summation and without the performance or cost penalties of sorting algorithms or complex data structures, by the inclusion of simple and inexpensive logic into modern NICs. Paper.
OpenSoC: a Flexible, Parameterizable, Open NoC Generation Tool (LBNL): With future technology scaling, we anticipate many-core chips and large-scale systems to play a crucial role in applications ranging from HPC and big data down to embedded computing. However, the architectures we expect in the future are hard to simulate or emulate today. That is because detailed or cycle-accurate software simulation can be very slow, whereas hardware emulation requires an often excessively long development time and offers limited internal visibility. This project is developing a highly-configurable open-source system-on-chip (SoC) platform that provides both software and hardware models, by using the CHISEL language. CHISEL offers both behavioral (software) simulation and hardware emulation from the same code base. This infrastructure will be freely available for research as well as manufacturing. Currently this project focuses on the on-chip network, but will later expand to the complete SoC to include cores, memory controllers, and other IP blocks.
Scalability of cache coherency to thousands of cores (LBNL): Still in the early stages, the question this project aims to answer is what form of cache coherency is prudent for future 1024-processor chip multiprocessors. If no current hardware, software or hybrid scheme is optimal, we will research the bottlenecks and cost tradeoffs, and develop new protocols. The project on rapid evaluation of future large-scale chips (described above) will provide a solid infrastructure to provide concrete and detailed data to enable us to complete this study.
Channel Reservation Protocol for Over-Subscribed Channels and Destinations (Stanford): The channel reservation protocol (CRP) focuses on system-wide networks, such as supercomputer networks. Such networks are often oversubscribed. To make matters worse, workloads in such networks can overstress destinations, creating hotspots. The most popular congestion control technique in today's networks is explicit congestion notification (ECN). However, ECN only acts after congestion is formed and thus reacts slowly to changes in the traffic pattern. Furthermore, ECN's configuration parameters are overly sensitive to the traffic pattern and network design. To mitigate these problems, CRP prevents congestion from ever occurring by allowing sources to reserve bandwidth in multiple resources with a single request, but avoids idling of resources compared to circuit switching. Furthermore, CRP prevents congestion from occurring in traffic patterns where ECN is ineffective, such as congestion formed in network channels by short-lived flows generated by a large combination of source-destination pairs. This way, a benign flow is not affected by tree saturation caused by an adversarial flow. CRP is an extension to the speculative reservation protocol (SRP)  because SRP does not reserve bandwidth in network channels. Paper.
Elastic buffer flow control for on-chip networks (Stanford): This is my longest-lasting project and is a significant part of my PhD thesis. This work proposes elastic buffer flow control which uses the on-chip channel pipeline flip-flops for buffering instead of router buffers by adding a control logic block to drive the master and slave latch enable inputs independently. This increases the performance over cost efficiency but also significantly simplifies the network thus reducing cycle time. Also, I have designed the two-cycle router which minimizes cycle time, and the single-cycle router which minimizes latency but is still clocked faster than a virtual channel router. Elastic buffer networks use physical channels instead of virtual channels for traffic separation. I also propose a mechanism to sense congestion because credits are no longer used. Elastic buffers enable networks to remove buffers but maintain buffering. Main paper.
Evaluating bufferless flow control for on-chip networks (Stanford): In this work I investigate bufferless flow control and compare it against an optimized buffered on-chip network. As shown, the allocator required by bufferless flow control contains a long serial path because all flits must be sent to an output port without any cycle delay. Moreover, the lack of backpressure means that very long buffers or additional mechanisms are required to prevent deadlocks or packet drops at the endpoints. Also, buffers can be implemented as custom SRAMs which have considerably reduced costs, especially occupied area. Furthermore, this work shows that with buffer bypassing, bufferless flow control provides only 1.5% energy savings at best but still has the side-effects of bufferless flow control outlined in this work. This paper was a best paper award candidate and was very well perceived and cited. It addressed an important debate present at the on-chip network community at the time. Paper.
Packet chaining: Efficient single-cycle allocation for on-chip networks (Stanford): Packet chaining improves the matching efficiency of separable allocators without extending cycle time. Packet chaining chains short packets together, to reuse connections of departing packets. This allows an allocator to build up an efficient matching over a number of cycles like incremental allocation, but not limited by packet length. Packet chaining uses a separate allocator which locates packets eligible to reuse connections of departing packets. Packet chaining provides considerable performance gains and enables separable allocators to provide performance comparable to or higher than more expensive and slower allocators. This is important because short packets often dominate chip multiprocessor traffic, and many networks are constrained by cycle time or cannot afford more complex allocators. Paper.
An analysis of on-chip networks for large-scale chip multiprocessors (Stanford): This work is an early attempt to co-design an on-chip network with the cache hierarchy in a chip multiprocessor. This work compares different network topologies by simulating various networks and system parameters, and analyzing the implications in applications and cache parameters. It also uses a cost model to estimate the network's area and power costs relative to the system. As part of this study, I examine the effect some system-level parameters, such as L2 cache sharing, size and associativity have on network usage and performance. This work also evaluates a hierarchical network design where a small but variable number of processing cores and L2 cache slices are grouped into one tile with a local interconnect, and tiles are connected with a global interconnect. Paper.
Adaptive backpressure: Efficient buffer management for on-chip networks (Stanford): This work proposes a novel scheme that improves the utilization of dynamically managed router input buffers by continuously adjusting the stiffness of the flow control feedback loop in response to observed traffic conditions. This minimizes the amount of buffer space occupied unproductively by stalled packets and leads to more efficient distribution of buffer space as well as improves isolation between multiple concurrently executing workloads with differing performance characteristics. Paper.
Network congestion avoidance through speculative reservation (Stanford): The speculative reservation protocol (SRP) prevents tree saturation from hotspot destinations by having sources reserve bandwidth from their destinations in advance of packet submission. This way, benign flows are not affected by other flows causing the hotspot. SRP uses speculative packets which are sent without waiting for a reservation, but are dropped in case of contention. Paper.
IPIF to PCI bridge (undergraduate degree thesis at UOC): I designed and implemented a fully functional bridge interconnecting IPIF and PCI buses. The bridge complies with both bus standards and is able to initiate and handle all types of transactions. I also developed a comprehensive behavioral testbench, and connected a PowerPC in the IPIF bus as the final testing step. I performed behavioral, post-synthesis and post-place and route verification simulations using Xilinx ISE, Xilinx EDK, Modelsim and NC-Launch. This project was later used by Ellemedia company in one of their commercial projects. Comprehensive documentation can be found here.
Switch Network Interface Card (UOC): As part of a large project within my research group and in cooperation with George Kalokerinos (hardware engineer), I developed a fully functional PCI-X network interface card. The purpose of this project was to enable communication between a PC, and a computer interconnection buffered crossbar switching fabric developed by our group. The PCI-X NIC was tested and verified standalone and connected to the rest of the project, using Xilinx FPGAs. My contribution included writing parts of the documenting technical report and the research paper submitted to HiPEAC 2007 conference to be graded 4/5 overall but not published. This paper was also submitted and published in HiPEAC 2nd Industrial Workshop help in Eindhoven on Octomber 17th, 2006. For behavioral verification, I developed a complete PCI-X testbench which is now maintained as a separate project for future use. Documentation for this project can be found here.
Approaching Ideal NoC Latency with Pre-Configured Routes (UOC): For my MSc thesis I investigated and proposed the design of a low-latency network-on-chip. I was the only one working on this project, always in cooperation with my supervisor prof. Katevenis. This network-on-chip offers performance closely comparable to that of long wires, and is based on the simple observation that since long wires include buffer elements anyway, we can replace these buffers with primitive multiplexer cells with minimum latency penalty. This is only feasible if these multiplexer cells have been preconfigured. We thus define run-time reconfigurable preferred paths within the network. In those paths a hop imposes latency of only approximately 500 ps, independently of the clock cycle. This is compared to 2 ns of 1 clock cycle/node at 500 MHz, which is a typical latency for related work. Non-preferred paths offer what related work is offering at best: 1 clock/cycle per node at 667 MHz (typical case) or 400 MHz (worst case). This NoC resurrects “mad postman”, an idea present in past bibliography in inter-processor chip networks before NoCs evolved as a major research area. As part of our research, we revised and modified “mad postman” accordingly to utilize in our proposed NoC. This work includes other novelties as well, such as the implementation of a buffered crossbar-like switching node which is in contrary to the vast majority of related work. We also investigate a customized topology for CMPs. We have published a paper at the 1st IEEE NoC symposium. As the first step in our research, we ran post P&R simulations with various library cells which confirmed our belief that this research attempt is certainly worthwhile. We have published a paper in HiPEAC's 1st industrial workshop containing these preliminary observations, ideas and supporting results. Complete documentation and presentation of this work can be found in my master's thesis or ICS-FORTH technical report.
Below I briefly present one software project (program) I carried out during my undergradate years in my free time, before any of the other projects presented in this page:
I developed a userspace encryption program named Turboencoder. This project was developed purely as a hobby as I was interested in encryption algorithms and anti-cracking techniques. Through this project I was able to develop my own encryption algorithm, utilize well-known encryption standards, familiarize myself with anti-cracking techniques, implement and improve the most interesting ones and familiarize myself with C++ and win32 programming. Turboencoder is present in many freeware websites, and was only recently removed from tucows where it remained for 2 years and 4 months with 4 cows ranking. More information can be found at: http://users.forthnet.gr/ath/mihelog/1.htm.