The HBIR Language ================= An HBIR program specifies a single computational kernel. The kernel is a computation that runs on a set of arbitrary-dimensional input arrays and produces more arrays as outputs. In other words, the program specifies the implementation of a function $f: (A_1,..,A_n) \rightarrow (B_1,..,B_m)$ where each $A_k$ and $B_k$ is a multidimensional array. An HBIR program consists of four *segments:* * **target**: The machine model. This segment describes the set-in-stone silicon resources of the target hardware. An instance of this segment would ship with a particular instance of the HammerBlade hardware. * **config**: The system configuration. The configuration abstracts the physical hardware resources from `target` as a virtual machine model. Specifically, it organizes the machine into a set of *tile groups* $\\{\mathbb{T_i}\\}_{i \in \mathbb{I}}$ that will be programmed independently. * **data**: Memory allocations. This segment describes how logical array inputs $A_k$ and outputs $B_k$ are mapped and partitioned across the tile groups $\\mathbb{T_i}$ defined in `config`. * **code**: The computation itself. This segment lists the SPMD code to execute on each tile group. The code consists of a set of programs, $g_i$, each destined to execute on the corresponding tile group, $\mathbb{T}_i$, defined in `config`. Each $g_i$ can read to the inputs and write to the outputs defined in `data`. Furthermore, the behavior of $g_i$ can optionally vary across $\mathbb{T}_i$'s available computational units, so it may be regarded as either MPMD or SPMD, depending on the particular use case. In other words, the implementation of $ f $ is given by a set of $g_i:(j, A_1,..,A_n)\rightarrow (B_1,..,B_m)$, where $i \in \mathbb{I}$ specifies a particular tile group, $ \mathbb{T}_i $, and $ j \in \mathbb{T}_i$ specifies a particular computational unit in that tile group. $B_1,..,B_m$ will contain the $f(A_1,..,A_n)$ once execution of all $g_i$ completes. The following subsections describe the four segments of an HBIR program. A valid HBIR program must have all four segments with strict dependencies in terms of the following precedence: target, config, data, code. The first three segments describe hardware configuration and memory mappings while the last segment describes high-level algorithmic details; hence the first three segments have very strict syntax and little support for extraneous operations. Target ------ The **target** segment expresses the static parameters of the HammerBlade hardware that the programmer would like to target. This corresponds to a specific instance of the HammerBlade architecture which can support varying amounts of tiles and memory. target { memory g[4] { size 8G; width 8B; }; ... } The first part of the segment declares the shared, global memories that are available to the rest of the hardware. Programmers specify a name, number of memories, size, and access width for the instance of memory. The instance of memory can then be referenced in different segments of an HBIR program. In the code snippet above, we are declaring a 4 global memories named `g` with a size of 8GB and an access width of 8B. Other segments of an HBIR program can reference `g` by referencing `target.g` and can also reference any of the fields of `g` as well. Referencing memory with discrete, real sizes are often not useful so memory is typically mapped to logical data structures for use in the code segment (described later). target { ... tile t[4][4] { memory l [4] { size 16K; width 8B; }; }; } The next part of the segment declares the statically defined fabric in the target architecture. Programmers specify a name and the dimensions of the tile array. Compute resources (CGRA and CPU) are implicitly included in a tile declaration but programmer can optionally specify local memory in the same manner global memories are declared. In the code snippet above, a tile array of 4x4 is declared, with each tile having a 16KB local memory with an access width of 8B. Similar to global memory, programmers can reference this memory by referencing the tile's name. For example, `target.t.l.size` accesses the size of the local memory in the tile array. Notice that both memory accesses and tile accesses don't allow indexing to specific resources. This is because we don't expect the tiles to be homogeneous as well as the global memory structures available. target { memory g[2] { size 8G; width 8B; }; tile t[128][64] { memory l { size 64K; width 8B; }; }; } In the above code snippet, we are modeling the architecture seen in Figure 1 by mapping the 2 L2 caches as a global memory with each having 8GB and 8B access width. It also has has 128x64 tiles with each one having a small local memory of 64KB and also an access width of 8B. Config ------ The **config** segment expresses the dynamic configuration of the hardware that best fits an application. This section overlays any physical topology onto the fabric by grouping together resources defined in the target segment. Currently, groups only reference physical tiles and do not use any global memories or local memories defined in the target segment. config { group tg[4][4] { tile target.t[x][y]; }; } To declare a group, users specify a name, the dimensions of the group, and specifies how the group's tiles index into the physical tiles defined in the target segment. The most common group involves just specifying a fixed-size, flat grid of tiles from the tile array. Following with previous snippets of code, the tile array is named t in our example and the programmer has specified a flat grid of 16 tiles from the tile array. The variables x and y are used to represent indexing into specific tiles in the array which is necessary for more complicated topologies. config { group tg[target.t.x_max][target.t.y_max] { tile target.t[x][y]; }; } Programmers can also use `x_max` and `y_max` in reference to the tile array to represent the maximum dimensions of the declared array. This prevents hard-coding parameters in this segment which translates to statically allocating groups based off *specific* hardware rather than having this segment work off of a possibly changing target segment. config { group tga[2][4] { tile target.t[x][y]; }; group tgb[2][4] { tile target.t[x+2][y]; }; } Programmers are also allowed to declare multiple groups in the segment. It is important that indexing into the tile array in different groups inherently prevent any overlapping tiles and any compilers built for HBIR should check this at compile-time. The code snippet above declares two groups `tga` and `tgb` which are both rectangles with dimension 2x4 but maps them next to each other. config { group grid[target.t.y_max] { group row[target.t.x_max] { tile target.t[grid.x][row.x]; }; }; } Finally, programmers can also nest groups. Data ---- The data segment expresses how logical data structures, currently vectors and arrays, map to to physical memory, both global and local, declared in the target segment. The data structures declared here are used in the code segment to express the high-level algorithm/application. data{ dim = 500; in A : int[3][3] { location : config.t, layout : block } out B : int[3][3] { location : config.t, layout : block } } The programmer specifies a name, the type (`int`, `float`, `bool`) and dimension of the data structure (i.e., vector or array of a certain size), how it maps to the declared groups, and several flags. Currently, flags for blocked, replicated, or striped allow programmers to specify how data should be distributed for the application. A host or device flag can also be set to specify whether the data is an input (host) or output (device) of the algorithm. Simple constant declarations are also allowed as often vectors and arrays use the same dimensions. These have to be declared before declaring any memory mappings. The code snippet above declares memory mappings to be used for a vector-vector add application (i.e., `C = A+B`). In this example, A and B are declared as integer vectors with a size of 500 that map to the global memory g defined in the target segment. The host flag is also set to indicate these two arrays are considered input arrays to the application and populated by a host. `C` is declared as an integer vector with a size of 500 that also maps to global memory `g` defined in the target segment. The device flag is set to indicate that the data structure is an output and will be populated by the application. data{ ... C: int[dim] = block[target.t.x_max][target.t.y_max] { target.g[x][y]; chunked; device; }; ... } Other flags are chunked, replicated, and striped which change how the data is distributed across the data structure. Code ---- The code segment expresses the high-level application by tying it to groups defined in the config segment and using logical data structures defined in the data segment. This operates similar to CUDA where thread/block/grids determine the different execution contexts as well as the memory that's used. Programmers specify groups as well as indexing and the code that should run on tiles in this execution context. Programmers are also allowed to write simple code that lives outside of groups (global). code{ int g_done_flag = 0; config.tg[0][0]{ bsg_finish(); } config.tg[x][y]{ for(int i = x+(y*x_max); i