The HBIR Language
=================

An HBIR program specifies a single computational kernel.
The kernel is a computation that runs on a set of arbitrary-dimensional input arrays and produces more arrays as outputs.
In other words, the program specifies the implementation of a function
$f: (A_1,..,A_n) \rightarrow (B_1,..,B_m)$
where each $A_k$ and $B_k$ is a multidimensional array.

An HBIR program consists of four *segments:*

* **target**: The machine model. This segment describes the set-in-stone silicon resources of the target hardware. An instance of this segment would ship with a particular instance of the HammerBlade hardware.
* **config**: The system configuration. The configuration abstracts the physical hardware resources from `target` as a virtual machine model. Specifically, it organizes the machine into a set of *tile groups* $\\{\mathbb{T_i}\\}_{i \in \mathbb{I}}$ that will be programmed independently.
* **data**: Memory allocations. This segment describes how logical array inputs $A_k$ and outputs $B_k$ are mapped and partitioned across the tile groups $\\mathbb{T_i}$ defined in `config`.
* **code**: The computation itself. This segment lists the SPMD code to execute on each tile group. The code consists of a set of programs, $g_i$, each destined to execute on the corresponding tile group, $\mathbb{T}_i$, defined in `config`. Each $g_i$ can read to the inputs and write to the outputs defined in `data`. Furthermore, the behavior of $g_i$ can optionally vary across $\mathbb{T}_i$'s available computational units, so it may be regarded as either MPMD or SPMD, depending on the particular use case.

In other words, the implementation of $ f $ is given by a set of $g_i:(j, A_1,..,A_n)\rightarrow (B_1,..,B_m)$, where $i \in \mathbb{I}$ specifies a particular tile group, $ \mathbb{T}_i $, and $ j \in \mathbb{T}_i$ specifies a particular computational unit in that tile group. $B_1,..,B_m$ will contain the $f(A_1,..,A_n)$ once execution of all $g_i$ completes.

The following subsections describe the four segments of an HBIR program.
A valid HBIR program must have all four segments with strict
dependencies in terms of the following precedence: target, config, data,
code. The first three segments describe hardware configuration and
memory mappings while the last segment describes high-level algorithmic
details; hence the first three segments have very strict syntax and
little support for extraneous operations.


Target
------

The **target** segment expresses the static parameters of the
HammerBlade hardware that the programmer would like to target. This
corresponds to a specific instance of the HammerBlade architecture which
can support varying amounts of tiles and memory.

    target {
        memory g[4] {
          size 8G;
          width 8B;
        };

        ...
    }

The first part of the segment declares the shared, global memories that
are available to the rest of the hardware. Programmers specify a name,
number of memories, size, and access width for the instance of memory.
The instance of memory can then be referenced in different segments of
an HBIR program. In the code snippet above, we are declaring a 4 global
memories named `g` with a size of 8GB and an access width of 8B. Other
segments of an HBIR program can reference `g` by referencing `target.g` and
can also reference any of the fields of `g` as well. Referencing memory
with discrete, real sizes are often not useful so memory is typically
mapped to logical data structures for use in the code segment (described
later).

    target {
        ...

        tile t[4][4] {
            memory l [4] {
              size 16K;
              width 8B;
            };
        };
    }

The next part of the segment declares the statically defined fabric in
the target architecture. Programmers specify a name and the dimensions
of the tile array. Compute resources (CGRA and CPU) are implicitly
included in a tile declaration but programmer can optionally specify
local memory in the same manner global memories are declared. In the
code snippet above, a tile array of 4x4 is declared, with each tile
having a 16KB local memory with an access width of 8B. Similar to global
memory, programmers can reference this memory by referencing the tile's
name. For example, `target.t.l.size` accesses the size of the local memory
in the tile array. Notice that both memory accesses and tile accesses
don't allow indexing to specific resources. This is because we don't
expect the tiles to be homogeneous as well as the global memory
structures available.

    target {
      memory g[2] {
        size 8G;
        width 8B;
      };

      tile t[128][64] {
        memory l {
          size 64K;
          width 8B;
        };
      };
    }

In the above code snippet, we are modeling the architecture seen in
Figure 1 by mapping the 2 L2 caches as a global memory with each having
8GB and 8B access width. It also has has 128x64 tiles with each one
having a small local memory of 64KB and also an access width of 8B.


Config
------

The **config** segment expresses the dynamic configuration of the
hardware that best fits an application. This section overlays any
physical topology onto the fabric by grouping together resources defined
in the target segment. Currently, groups only reference physical tiles
and do not use any global memories or local memories defined in the
target segment.

    config {
        group tg[4][4] {
            tile target.t[x][y];
        };
    }

To declare a group, users specify a name, the dimensions of the group,
and specifies how the group's tiles index into the physical tiles
defined in the target segment. The most common group involves just
specifying a fixed-size, flat grid of tiles from the tile array.
Following with previous snippets of code, the tile array is named t in
our example and the programmer has specified a flat grid of 16 tiles
from the tile array. The variables x and y are used to represent
indexing into specific tiles in the array which is necessary for more
complicated topologies.

    config {
        group tg[target.t.x_max][target.t.y_max] {
            tile target.t[x][y];
        };
    }

Programmers can also use `x_max` and `y_max` in reference to the tile
array to represent the maximum dimensions of the declared array. This
prevents hard-coding parameters in this segment which translates to
statically allocating groups based off *specific* hardware rather than
having this segment work off of a possibly changing target segment.

    config {
        group tga[2][4] {
            tile target.t[x][y];
        };

        group tgb[2][4] {
            tile target.t[x+2][y];
        };
    }

Programmers are also allowed to declare multiple groups in the segment.
It is important that indexing into the tile array in different groups
inherently prevent any overlapping tiles and any compilers built for
HBIR should check this at compile-time. The code snippet above declares
two groups `tga` and `tgb` which are both rectangles with dimension 2x4 but
maps them next to each other.

    config {
        group grid[target.t.y_max] {
          group row[target.t.x_max] {
            tile target.t[grid.x][row.x];
          };
        };
    }

Finally, programmers can also nest groups.


Data
----

The data segment expresses how logical data structures, currently
vectors and arrays, map to to physical memory, both global and local,
declared in the target segment. The data structures declared here are
used in the code segment to express the high-level
algorithm/application.

    data{
      dim = 500;

      in A : int[3][3] {
        location : config.t,
        layout : block
      }

      out B : int[3][3] {
        location : config.t,
        layout : block
      }
    }

The programmer specifies a name, the type (`int`, `float`, `bool`) and
dimension of the data structure (i.e., vector or array of a certain size),
how it maps to the declared groups, and several flags. Currently, flags
for blocked, replicated, or striped allow programmers to specify how
data should be distributed for the application. A host or device flag
can also be set to specify whether the data is an input (host) or output
(device) of the algorithm. Simple constant declarations are also allowed
as often vectors and arrays use the same dimensions. These have to be
declared before declaring any memory mappings.

The code snippet above declares memory mappings to be used for a
vector-vector add application (i.e., `C = A+B`). In this example, A and B
are declared as integer vectors with a size of 500 that map to the
global memory g defined in the target segment. The host flag is also set
to indicate these two arrays are considered input arrays to the
application and populated by a host. `C` is declared as an integer vector
with a size of 500 that also maps to global memory `g` defined in the
target segment. The device flag is set to indicate that the data
structure is an output and will be populated by the application.

    data{
        ...
        C: int[dim] = block[target.t.x_max][target.t.y_max] {
            target.g[x][y];
            chunked;
            device;
        };
        ...
    }

Other flags are chunked, replicated, and striped which change how the
data is distributed across the data structure.


Code
----

The code segment expresses the high-level application by tying it to
groups defined in the config segment and using logical data structures
defined in the data segment. This operates similar to CUDA where
thread/block/grids determine the different execution contexts as well as
the memory that's used. Programmers specify groups as well as indexing
and the code that should run on tiles in this execution context.
Programmers are also allowed to write simple code that lives outside of
groups (global).

    code{
        int g_done_flag = 0;

        config.tg[0][0]{
            bsg_finish();
        }

        config.tg[x][y]{
            for(int i = x+(y*x_max); i<csize; i=i+1){
                C[i] = A[i] + B[i];
            }
            g_done_flag = 1;
        }
    }

The current implementation of the language exposes some `bsg_manycore`
specific functions and also doesn't infer much from the application at
all but the end-goal is to only have high-level, tile-independent,
C-like code with the only reference to hardware being the group that the
code is tied to. The above code snippet is how a simple vector-vector
add is currently implemented. While only one group is referenced above,
multiple groups can be referenced. In addition, within each group,
single tiles can be referenced to give specific code in addition to
generalized code that runs on all tiles in a group (by using symbolic
indexes). In this example, every tile inside the group `tg` runs a basic
for loop that has data indexing into the global arrays `A`, `B`, and `C`
hard-coded. Tile (0, 0) is also given a special instruction to call
`bsg_finish()`. Global code is also written to declare `g_done_flag`.

### The Mini-Language for Code

The code segment uses a simple, C-like, SSA-form imperative language.
This section outlines basic features that the language currently
supports.

- Generic types: Supports ints, floats, and booleans as generics.

- Basic arithmetic expressions: Supports basic arithmetic and
  boolean operations with precedence rules being equivalent to C.

- Declaration statements: Supports static declarations of arrays and
  basic generic types.

- Control flow statements: Supports basic if-else, while loops, for
  loops, and break statements.

- Miscellaneous features: Supports printing as a primitive.