Getting Started

Calyx is an intermediate language and infrastructure for building compilers that generate custom hardware accelerators. These instructions will help you set up the Calyx compiler and associated tools. By the end, you should be able to compile and simulate hardware designs generated by Calyx.

Compiler Installation

Install Rust (it should automatically install cargo).

Clone the repository:

git clone https://github.com/cucapra/calyx.git

Then build the compiler:

cargo build

You can invoke the compiler in one of two ways:

cargo run -- --help # Rebuilds the compiler if the sources changed
./target/debug/futil --help # Default debug build of the compiler

Running Core Tests

The core test suites tests the Calyx compiler passes. Install the following tools for running the core tests:

  1. runt hosts our testing infrastructure. Install with: cargo install runt
  2. jq is a command-line JSON processor:
    • Ubuntu: sudo apt install jq
    • Mac: brew install jq
    • Other platforms: JQ installation

Build the compiler:

cargo build

Then run the core tests with:

runt -i core

If everything has been installed correctly, this should not produce any failing tests.

Installing the Command-line Driver

The Calyx driver wraps the various compiler frontends and backends to simplify running Calyx programs.

Install Flit:

pip3 install flit

Install fud (from the root of the repository):

flit -f fud/pyproject.toml install -s

Configure fud:

fud config global.futil_directory <full path to Calyx repository>

Check the fud configuration:

fud check

fud will report certain tools are not available. This is expected.

Simulating with Verilator

Verilator is the default simulation backend for Calyx. If you want to run your Calyx programs, you need to install Verilator.

On a Mac, install with:

brew install verilator

Otherwise, you will probably need to compile it from source yourself (the versions in Linux repositories are generally out of date). There instructions are stolen from Verilator's installation instructions:

git clone https://github.com/verilator/verilator
cd verilator
git pull
git checkout master
autoconf
./configure
make
sudo make install

Use fud to check you have the right version:

fud check

fud should report that the verilator binary was installed and has the right version. Some tools will be reported missing. This is expected.

Generating Simulation Results

By default, simulating with Verilator will produce a VCD file which is not human-readable. Our tests use vcdump to transform VCD file into JSON files.

Install vcdump:

cargo install vcdump

Use fud to check the right version was installed:

fud check

It should report that the vcdump binary is available and has the right version. Some tools will be reported missing. This is expected.

Running a Hardware Design

We're all set to run a Calyx hardware design now. Run the following command:

fud e examples/tutorial/language-tutorial-iterate.futil \
  -s verilog.data examples/tutorial/data.json \
  --to dat -v

This command will compile examples/tutorial/language-tutorial-iterate.futil to Verilog using the Calyx compiler, simulate the design using the data in examples/tutorial/data.json, and generate a JSON representation of the final memory state.

Congratulations! You've simulated your first hardware design with Calyx.

Where to go next?

fud: The Calyx Driver

Working with Calyx involves a lot of command-line tools. For example, an incomplete yet daunting list of CLI tools used by Calyx is:

  • All the Calyx frontends.
  • Calyx compiler and its various command line tools
  • Verilator, the Verilog simulation framework used to test Calyx-generated designs.
  • Waveform viewers to see the results of simulation

fud aims to provide a simple interface for using these toolchains and executing them in a pipeline. The source for fud is here.

Installation

You need Flit to install fud. Install it with pip3 install flit.

You can then install fud with

flit install

(If using this method to install fud, pip3 should be version >= 20)

If you are working on fud itself, you can install it with a symlink with:

flit install --symlink

You can also install fud with

flit build
pip3 install dist/fud-0.1.0-py3-none-any.whl

Finally, point fud to the root of the repository:

fud config global.futil_directory <full path to Calyx repository>

Configuration

Fud uses a global configuration file to locate tool paths and default values. To view the configuration, use fud c or fud config.

Check. Fud can automatically check if your configuration is valid and can help you set certain variables. Perform this check with:

fud check

Viewing keys. To view the current value of a key, use fud config key. For example, the following shows the path to the Calyx compiler.

fud config stages.futil.exec

Updating keys. Keys can be updated using fud config key value. For example, the following command updates the path to the Calyx compiler.

fud config stages.futil.exec ./target/debug/futil

Adding Backends

fud wraps both frontends and backends for Calyx. For a minimally useful fud installation, you need to configure the Verilator backend and accompanying tools.

Verilator. We use the open source Verilator tool to simulate Verilog programs generated by the Calyx compiler. Install Verilator by following the instructions.

By default, fud will use the verilator executable to run Verilator. To use a different binary, configure the path by:

fud config stages.verilog.exec <binary>

Vcdump. Vcdump is a tool for converting vcd (Value Change Dump) files to JSON for easier analysis with the command line.

Install it with:

cargo install vcdump

Dahlia Frontend

In order to use the Dahlia frontend with Fud, first install Dahlia. Once Dahlia is compiled, point fud to the Dahlia compiler binary:

fud config stages.dahlia.exec <full path to dahlia repo>/fuse

Python frontends (Systolic array, NTT, MrXL, TVM Relay)

You need flit to install our Python frontends.

pip3 install flit

Our Python frontends use a Calyx ast library written in Python. Install with:

cd calyx-py && flit install -s

Frontend specific instructions:

  • Systolic array: Nothing else needed.
  • NTT:
    • Install dependencies: pip3 install prettytable
    • Install external fud stage: fud register ntt -p frontends/ntt-pipeline/fud/ntt.py
  • MrXL:
    • Install mrxl binary: cd frontends/mrxl && flit install -s
    • Install mrxl external stage for fud: fud register mrxl -p frontends/mrxl/fud/mrxl.py
  • TVM Relay: See instructions.

Adding Synthesis Backends

fud supports wraps the Vivado (synth-verilog) and Vivado HLS (vivado-hls) tools to generate area and resource estimates for Calyx designs. See the instructions to configure them.

Working with Stages

Fud is structured as a sequence of stages that transform inputs of one form to outputs.

Stages. fud transforms a file in one stage into a file in a later stage. The --from and --to options specify the input and output stages to the fud exec subcommand. Use fud info to view all possible stages.

Guessing stages. fud will try to guess the starting stage by looking at the extension of the input file and output file (if specified using the -o flag). If it fails to guess correctly or doesn't know about the extension, you can manually set the stages using --to and --from.

Examples

These commands will assume you're in the root directory of the Calyx repository.

Compiling Calyx. Fud wraps the Calyx compiler and provides a set of default compiler options to compile Calyx programs to Verilog.

# Compile Calyx source in the test simple.expect
# to Verilog. We must explicitly specify the input
# file type because it can not be guessed from
# the file extension.
fud exec examples/futil/simple.expect --from futil --to verilog

Fud can explain its execution plan when running a complex sequence of steps using the --dry-run option.

# Dry run of compiling the Dahlia dot product file
# to Calyx. As expected, this will *only* print
# the stages that will be run.
fud exec examples/dahlia/dot-product.fuse --to futil --dry-run

Simulating Calyx. Fud can compile a Calyx program to Verilog and simulate it using Verilator.

# Compile and simulate a vectorized add implementation
# in Calyx using the data provided,
# then dump the vcd into a new file for debugging.
# === Calyx:   examples/futil/vectorized-add.futil
# === data:    examples/dahlia/vectorized-add.fuse.data
# === output:  v-add.vcd
fud exec \
  examples/futil/vectorized-add.futil \
  -o v-add.vcd \
  -s verilog.data examples/dahlia/vectorized-add.fuse.data

Simulating Dahlia. The following command prints out the final state of all memories by specifying --to dat.

# Compile a Dahlia dot product implementation and
# simulate in verilog using the data provided.
# === Dahlia: examples/dahlia/dot-product.fuse
# === data:   examples/dahlia/dot-product.fuse.data
#     (`.data` is used as an extension alias for `.json`)
fud exec \
  examples/dahlia/dot-product.fuse \
  --to dat \
  -s verilog.data examples/dahlia/dot-product.fuse.data

Interpreting Calyx. In addition to Verilator, fud can execute Calyx programs using the experimental interpreter.

# Execute a Calyx program without compiling it,
# producing a JSON snapshot of the final state.
# === Calyx:   tests/correctness/while.futil
# === data:    tests/correctness/while.futil.data
fud exec \
  tests/correctness/while.futil \
  -s interpreter.data tests/correctness/while.futil.data \
  --to interpreter-out

Synthesis and HLS Backend

Working with synthesis backends is painful and requires advanced debugging. For most things, we recommend not installing these tools yourself. If you're a Cornell Student, you can use our lab server, Gorgonzola.

fud uses Vivado to generate area and resource estimates for designs generated by Calyx. It also supports compiling Dahlia programs through the HLS-backend using Vivado HLS.

There are two ways to get fud working with Vivado.

Vivado and Vivado HLS over SSH

fud supports invoking these tools over SSH. You have to tell fud the username and hostname for a server that has these tools installed:

# vivado
fud config stages.synth-verilog.ssh_host <hostname>
fud config stages.synth-verilog.ssh_username <username>

# vivado hls
fud config stages.vivado-hls.ssh_host <hostname>
fud config stages.vivado-hls.ssh_username <username>

Note: vivado or vivado_hls have to be on the path of the remote machine for this to work. If you need the names to be something else, file an issue. fud currently does not support other names.

Vivado/VivadoHLS locally

We don't provide installation instructions for this. However, fud will look for vivado and vivado-hls binaries on the system. If these are installed, you can use fud to invoke these tools. You can change the paths fud looks for with

fud config stages.synth-verilog.exec <path> # update vivado path
fud config stages.vivado-hls.exec <path> # update vivado_hls path

External Stages

fud supports using stages that aren't defined in its main source tree. These are known as 'external stages' and the provide a mechanism for projects using Calyx to take advantage of fud. You can register an external stage with:

fud register stage_name -p /path/to/stage.py

Once an external stage is registered, it behaves exactly like any other stage.

You can remove an external stage with:

fud register stage_name --delete

The following defines a stage that transforms MrXL programs to Calyx programs.

from fud.stages import Stage, SourceType
from fud.utils import shell


class MrXLStage(Stage):
    """
    Stage that invokes the MrXL frontend.
    """

    def __init__(self, config):
        super().__init__(
            name="mrxl",
            target_stage="futil",
            input_type=SourceType.Path,
            output_type=SourceType.Stream,
            config=config,
            description="Compiles MrXL to Calyx.",
        )
        self.setup()

    @staticmethod
    def defaults():
        return {
            "exec": "mrxl"
        }

    def _define_steps(self, input_path):
        @self.step(description=self.cmd)
        def run_mrxl(mrxl_prog: SourceType.Path) -> SourceType.Stream:
            return shell(f"{self.cmd} {str(mrxl_prog)}")

        return run_mrxl(input_path)


# Export the defined stages to fud
__STAGES__ = [
    MrXLStage
]

External stages must define default values for configuration keys using the Stage.defaults() static method.

Stage Configuration

Like normal stages, external stages can have persistent configuration information saved using fud config.

To add persistent stage configuration, run:

fud config stages.<stage-name>.<key> <value>

To dynamically override the value of a field during execution, use the -s flag:

fud e -s <stage-name>.<key> <value> ...

The override order for stage configuration is:

  1. Dynamic values provided by -s.
  2. Configuration value in the fud config.
  3. Default value provided by Stage.defaults()

The Calyx Compiler

The source code documentation for the compiler can be found here.

The Calyx compiler has several command line to control the execution of various passes and backends.

Specifying Primitives Library

The compiler implementation uses a standard library of components to compile programs.

The only standard library for the compiler is located in:

<path to Calyx repository>/primitives

Specify the location of the library using the -l flag:

cargo run -- -l ./primitives

Primitive Libraries Format

The primitive libraries consist of a .futil file paired with a .sv file. The .futil file defines a series of Calyx shim bindings in extern blocks which match up with SystemVerilog definitions of those primitives. These libraries may also expose components written in Calyx, usually defined using primitives exposed by the file.

No Calyx program can work without the primitives defined in the Core Library.

Controlling Passes

The compiler is organized as a sequence of passes that are run when the compiler executes.

To get a complete list of all passes, run the following from the repository root:

cargo run -- --list-passes

This generates results of the form:

Passes:
- collapse-control
- compile-control
...

Aliases:
- all: well-formed, papercut, remove-external-memories, ...
...

The first section list all the passes implemented in the compiler. The second section lists aliases for combination of passes that are commonly run together. For example, the alias all is an ordered sequence of default passes executed when the compiler is run from the command-line.

The command-line provides two options to control the execution of passes:

  • -p, --pass: Execute this pass or alias. Overrides default alias.
  • -d, --disable-pass: Disable this pass or alias. Takes priority over -p.

For example, we can run the following to disable the static-timing pass from the default execution alias all:

cargo run -- examples/futil/simple.futil -p all -d static-timing

Core Library

This library defines a standard set of components used in most Calyx programs such as registers and basic bitwise operations.

Contents


Numerical Operators

std_reg<WIDTH>

A WIDTH-wide register.

Inputs:

  • in: WIDTH - An input value to the register WIDTH-bits.
  • write_en: 1 - The one bit write enabled signal. Indicates that the register should store the value on the in wire.

Outputs:

  • out: WIDTH - The value contained in the register.
  • done: 1 - The register's done signal. Set high for one cycle after writing a new value.

std_const<WIDTH,VAL>

A constant WIDTH-bit value with value VAL.

Inputs: None

Outputs:

  • out: WIDTH - The value of the constant (i.e. VAL)

std_lsh<WIDTH>

A left bit shift. Performs LEFT << RIGHT. This component is combinational.

Inputs:

  • left: WIDTH - A WIDTH-bit value to be shifted
  • right: WIDTH - A WIDTH-bit value representing the shift amount

Outputs:

  • out: WIDTH - A WIDTH-bit value equivalent to LEFT << RIGHT

std_rsh<WIDTH>

A right bit shift. Performs LEFT >> RIGHT. This component is combinational.

Inputs:

  • left: WIDTH - A WIDTH-bit value to be shifted
  • right: WIDTH - A WIDTH-bit value representing the shift amount

Outputs:

  • out: WIDTH - A WIDTH-bit value equivalent to LEFT >> RIGHT

std_add<WIDTH>

Bitwise addition without a carry flag. Performs LEFT + RIGHT. This component is combinational.

Inputs:

  • left: WIDTH - A WIDTH-bit value
  • right: WIDTH - A WIDTH-bit value

Outputs:

  • out: WIDTH - A WIDTH-bit value equivalent to LEFT + RIGHT

std_sub<WIDTH>

Bitwise subtraction. Performs LEFT - RIGHT. This component is combinational.

Inputs:

  • left: WIDTH - A WIDTH-bit value
  • right: WIDTH - A WIDTH-bit value

Outputs:

  • out: WIDTH - A WIDTH-bit value equivalent to LEFT - RIGHT

std_slice<IN_WIDTH, OUT_WIDTH>

Slice out the lower OUT_WIDTH bits of an IN_WIDTH-bit value. Computes in[out_width - 1 : 0]. This component is combinational.

Inputs:

  • in: IN_WIDTH - An IN_WIDTH-bit value

Outputs:

  • out: OUT_WIDTH - The lower OUT_WIDTH bits of in

std_pad<IN_WIDTH, OUT_WIDTH>

Given an IN_WIDTH-bit input, zero pad from the left to an output of OUT_WIDTH-bits. This component is combinational.

Inputs:

  • in: IN_WIDTH - An IN_WIDTH-bit value to be padded

Outputs:

  • out: OUT_WIDTH - The padded value

Logical Operators

std_not<WIDTH>

Bitwise NOT. This component is combinational.

Inputs:

  • in: WIDTH - A WIDTH-bit input.

Outputs:

  • out: WIDTH - The bitwise NOT of the input (~in)

std_and<WIDTH>

Bitwise AND. This component is combinational.

Inputs:

  • left: WIDTH - A WIDTH-bit argument
  • right: WIDTH - A WIDTH-bit argument

Outputs:

  • out: WIDTH - The bitwise AND of the arguments (left & right)

std_or<WIDTH>

Bitwise OR. This component is combinational.

Inputs:

  • left: WIDTH - A WIDTH-bit argument
  • right: WIDTH - A WIDTH-bit argument

Outputs:

  • out: WIDTH - The bitwise OR of the arguments (left | right)

std_xor<WIDTH>

Bitwise XOR. This component is combinational.

Inputs:

  • left: WIDTH - A WIDTH-bit argument
  • right: WIDTH - A WIDTH-bit argument

Outputs:

  • out: WIDTH - The bitwise XOR of the arguments (left ^ right)

Comparison Operators

std_gt<WIDTH>

Greater than. This component is combinational.

Inputs:

  • left: WIDTH - A WIDTH-bit argument
  • right: WIDTH - A WIDTH-bit argument

Outputs:

  • out: 1 - A single bit output. 1 if left > right else 0.

std_lt<WIDTH>

Less than. This component is combinational.

Inputs:

  • left: WIDTH - A WIDTH-bit argument
  • right: WIDTH - A WIDTH-bit argument

Outputs:

  • out: 1 - A single bit output. 1 if left < right else 0.

std_eq<WIDTH>

Equality comparison. This component is combinational.

Inputs:

  • left: WIDTH - A WIDTH-bit argument
  • right: WIDTH - A WIDTH-bit argument

Outputs:

  • out: 1 - A single bit output. 1 if left = right else 0.

std_neq<WIDTH>

Not equal. This component is combinational.

Inputs:

  • left: WIDTH - A WIDTH-bit argument
  • right: WIDTH - A WIDTH-bit argument

Outputs:

  • out: 1 - A single bit output. 1 if left != right else 0.

std_ge<WIDTH>

Greater than or equal. This component is combinational.

Inputs:

  • left: WIDTH - A WIDTH-bit argument
  • right: WIDTH - A WIDTH-bit argument

Outputs:

  • out: 1 - A single bit output. 1 if left >= right else 0.

std_le<WIDTH>

Less than or equal. This component is combinational.

Inputs:

  • left: WIDTH - A WIDTH-bit argument
  • right: WIDTH - A WIDTH-bit argument

Outputs:

  • out: 1 - A single bit output. 1 if left <= right else 0.

Memories

std_mem_d1

A one-dimensional memory.

Parameters:

  • WIDTH - Size of an individual memory slot.
  • SIZE - Number of slots in the memory.
  • IDX_SIZE - The width of the index given to the memory.

Inputs:

  • `addr0: IDX_SIZE - The index to be accessed or updated
  • write_data: WIDTH - Data to be written to the selected memory slot
  • write_en: 1 - One bit write enabled signal, causes the memory to write write_data to the slot indexed by addr0

Outputs:

  • read_data: WIDTH - The value stored at addr0. This value is combinational with respect to addr0.
  • done: 1: The done signal for the memory. This signal goes high for one cycle after finishing a write to the memory.

std_mem_d2

A two-dimensional memory.

Parameters:

  • WIDTH - Size of an individual memory slot.
  • D0_SIZE - Number of memory slots for the first index.
  • D1_SIZE - Number of memory slots for the second index.
  • D0_IDX_SIZE - The width of the first index.
  • D1_IDX_SIZE - The width of the second index.

Inputs:

  • addr0: D0_IDX_SIZE - The first index into the memory
  • addr1: D1_IDX_SIZE - The second index into the memory
  • write_data: WIDTH - Data to be written to the selected memory slot
  • write_en: 1 - One bit write enabled signal, causes the memory to write write_data to the slot indexed by addr0 and addr1

Outputs:

  • read_data: WIDTH - The value stored at mem[addr0][addr1]. This value is combinational with respect to addr0 and addr1.
  • done: 1: The done signal for the memory. This signal goes high for one cycle after finishing a write to the memory.

std_mem_d3

A three-dimensional memory.

Parameters:

  • WIDTH - Size of an individual memory slot.
  • D0_SIZE - Number of memory slots for the first index.
  • D1_SIZE - Number of memory slots for the second index.
  • D2_SIZE - Number of memory slots for the third index.
  • D0_IDX_SIZE - The width of the first index.
  • D1_IDX_SIZE - The width of the second index.
  • D2_IDX_SIZE - The width of the third index.

Inputs:

  • addr0: D0_IDX_SIZE - The first index into the memory
  • addr1: D1_IDX_SIZE - The second index into the memory
  • addr2: D2_IDX_SIZE - The third index into the memory
  • write_data: WIDTH - Data to be written to the selected memory slot
  • write_en: 1 - One bit write enabled signal, causes the memory to write write_data to the slot indexed by addr0, addr1, and addr2

Outputs:

  • read_data: WIDTH - The value stored at mem[addr0][addr1][addr2]. This value is combinational with respect to addr0, addr1, and addr2.
  • done: 1: The done signal for the memory. This signal goes high for one cycle after finishing a write to the memory.

std_mem_d4

A four-dimensional memory.

Parameters:

  • WIDTH - Size of an individual memory slot.
  • D0_SIZE - Number of memory slots for the first index.
  • D1_SIZE - Number of memory slots for the second index.
  • D2_SIZE - Number of memory slots for the third index.
  • D3_SIZE - Number of memory slots for the fourth index.
  • D0_IDX_SIZE - The width of the first index.
  • D1_IDX_SIZE - The width of the second index.
  • D2_IDX_SIZE - The width of the third index.
  • D3_IDX_SIZE - The width of the fourth index.

Inputs:

  • addr0: D0_IDX_SIZE - The first index into the memory
  • addr1: D1_IDX_SIZE - The second index into the memory
  • addr2: D2_IDX_SIZE - The third index into the memory
  • addr3: D3_IDX_SIZE - The fourth index into the memory
  • write_data: WIDTH - Data to be written to the selected memory slot
  • write_en: 1 - One bit write enabled signal, causes the memory to write write_data to the slot indexed by addr0, addr1, addr2, and addr3

Outputs:

  • read_data: WIDTH - The value stored at mem[addr0][addr1][addr2][addr3]. This value is combinational with respect to addr0, addr1, addr2, and addr3.
  • done: 1: The done signal for the memory. This signal goes high for one cycle after finishing a write to the memory.

The Calyx Interpreter

The experimental Calyx interpreter resides in the interp/ directory of the repository. The interpreter supports all Calyx programs---from high-level programs that make heavy use of control operators, to fully lowered Calyx programs. (RTL simulation, in contrast, only supports execution of fully lowered programs.)

There are two ways to use the interpreter: you can directly invoke it, or you can use fud.

Basic Use

To run an example program, try:

cd interp && cargo run tests/control/if.futil

You can see the available command-line options by typing cargo run -- --help.

Interpreting via fud

The interpreter is available as a stage in fud, which lets you provide standard JSON data files as input and easily execute passes on the input Calyx program before interpretation.

You'll want to build the interpreter first:

cd interp && cargo build

Here's how to run a Calyx program:

fud e --to interpreter-out interp/tests/control/if.futil

To provide input data, set the interpreter.data variable, like so:

fud e --to interpreter-out \
    -s interpreter.data tests/correctness/while.futil.data \
    tests/correctness/while.futil

By default, fud will not transform the Calyx code before feeding it to the interpreter. To run passes before the interpreter, use the futil.flags variable in conjunction with the -p flag. For example, to fully lower the Calyx program before interpreting it:

fud e --to interpreter-out \
    -s futil.flags '-p all' \
    interp/tests/control/if.futil

Tools

Calyx uses an array of tools to simplify and streamline compiler testing, development, and usage.

Runt

Runt (Run Tests) is the expectation testing framework for Calyx. It organizes collections of tests into test suites and specifies configuration for them.

Runt uses runt.toml to define the test suites and configure them.

For instruction on using runt, see the official documentation.

exp Generator

The exp generator uses a Taylor series approximation to calculate the fixed point value of the natural exponential function e^x. The Maclaurin series for the function can be written as:

e^x = 1 + x + x^2/2! + x^3/3! + ... + x^n/n!

where n is the nth degree or order of the polynomial.

For signed values, we can take the reciprocal value:

e^(-x) = 1/e^x

The gen_exp.py file can generate an entire Calyx program for testing purposes. The main component contains memories x (for the input) and ret for the result of e^x. In order to generate an example program with degree 4, bit width 32, integer bit width 16, and x interpreted as a signed value:

./calyx-py/calyx/gen_exp.py -d 4 -w 32 -i 16 -s true

Similarly, it provides a function to produce only the necessary components to be dropped into other Calyx programs.

Installation

Install the calyx-py library.

Command Line Options

The command line options configure the degree (or order) of the taylor series, bit width, integer bit width, and sign.

  • --degree: The degree of the Taylor polynomial.
  • --width: The bit width of the value x.
  • --int_width: The integer bit width of the value x. The fractional bit width is then inferred as width - int_width.
  • --is_signed: The signed interpretation of the value x. If x is a signed value, this should be true and otherwise, false.

Editor Highlighting

Vim

The vim extension highlights files with the extension .futil. It can be installed using a plugin manager such as vim-plug using a local installation. Add the following to your vim plug configuration:

Plug '<path-to-calyx>/tools/vim'

And run:

:PlugInstall

Emacs

futil-mode is implements highlighting for .futil files in emacs. It is located in <repo>/tools/emacs/futil-mode.

Clone the repository, add the above path to your load path, and require futil-mode. If you use Spacemacs, this looks like adding the following lines to dotspacemacs/user-config in your .spacemacs file:

(push "~/.emacs.d/private/local/fuse-mode" load-path)
(require 'fuse-mode)

I imagine it looks very similar for pure emacs, but haven't actually tried it myself.

Visual Studio Code

Add a link to the Calyx VSCode extension directory to your VSCode extensions directory.

cd $HOME/.vscode/extensions
ln -s <calyx root directory/tools/vscode calyx.calyx-0.0.1

Restart VSCode.

Calyx Language Tutorial

This tutorial will familiarize you with the Calyx language by writing a minimal program by hand. Usually, Calyx code will be generated by a frontend. However, by writing the program by hand, you can get familiar with all the basic constructs you need to generate Calyx yourself!

Complete code for each example can be found in the tutorial directory in the Calyx repository.

Get Started

The basic building block of Calyx programs is a component which corresponds to hardware modules (or function definitions for the software-minded).

Here is an empty component definition for main along with an import statement to import the standard library:

import "primitives/std.lib";

component main(go: 1) -> (done: 1) {
  cells {
  }

  wires {
  }

  control {
  }
}

Put this in a file—you can call it language-tutorial-mem.futil, for example. (The futil file extension comes from an old name for Calyx.)

You can think of a component as a unit of Calyx code roughly analogous to a function: it encapsulates a logical unit of hardware structures along with their control. Every component definition has three sections:

  • cells: The hardware subcomponents that make up this component.
  • wires: A set of guarded connections between components, possibly organized into groups.
  • control: The imperative program that defines the component's execution schedule: i.e., when each group executes.

We'll fill these sections up minimally in the next sections.

A Memory Cell

Let's turn our skeleton into a tiny, nearly no-op Calyx program. We'll start by adding a memory component to the cells:

  cells {
    @external(1) mem = std_mem_d1(32, 1, 1);
  }

This new line declares a new cell called mem and the primitive component std_mem_d1 represents a 1D memory. You can see the definition of std_mem_d1, and all the other standard components, in the primitives/std.lib library we imported.

This one has three parameters: the data width (here, 32 bits), the number of elements (just one), and the width of the address port (one bit).

The @external(1) syntax is an extra bit of magic that allows us to read and write to the memory.

Next, we'll add some assignments to the wires section to update the value in the memory. Insert these lines to put a constant value into the memory:

  wires {
      mem.addr0 = 1'b0;
      mem.write_data = 32'd42;
      mem.write_en = 1'b1;
      done = mem.done;
  }

These assignments refer to four ports on the memory component: addr0 (the address port), write_data (the value we're putting into the memory), write_en (the write enable signal, telling the memory that it's time to do a write), and done (signals that the write was committed). Constants like 32'd42 are Verilog-like literals that include the bit width (32), the base (d for decimal), and the value (42).

Assignments at the top level in the wires section, like these, are "continuous". They always happen, without any need for control statements to orchestrate them. We'll see later how to organize assignments into groups.

The complete program for this section is available under examples/tutorial/language-tutorial-mem.futil.

Compile & Run

We can almost run this program! But first, we need to provide it with data. The Calyx infrastructure can provide data to programs from JSON files. So make a file called something like data.json containing something along these lines:

{
  "mem": {
    "data": [10],
    "format": {
      "numeric_type": "bitnum",
      "is_signed": false,
      "width": 32
    }
  }
}

The mem key means we're providing the initial value for our memory called mem. We have one (unsigned integer) data element, and we indicate the bit width (32 bits).

If you want to see how this Calyx program compiles to Verilog, here's the fud incantation you need:

fud exec language-tutorial-mem.futil --to verilog

Not terribly interesting! However, one nice thing you can do with programs is execute them.

To run our program using Verilator, do this:

fud exec language-tutorial-mem.futil --to dat -s verilog.data data.json

Using --to dat asks fud to run the program, and the extra -s verilog.data <filename> argument tells it where to find the input data. Executing this program should print:

{
  "cycles": 0,
  "memories": {
    "mem": [
      42
    ]
  }
}

Meaning that, after the program finished, the final value in our memory was 42.

Add Control

Let's change our program to use an execution schedule. First, we're going to wrap all the assignments in the wires section into a name group:

  wires {
    group the_answer {
      mem.addr0 = 1'b0;
      mem.write_data = 32'd42;
      mem.write_en = 1'b1;
      the_answer[done] = mem.done;
    }
  }

We also need one extra line in the group: that assignment to the_answer[done]. Here, we say that the_answer's work is one once the update to mem has finished. Calyx groups have compilation holes called go and done that the control program will use to orchestrate their execution.

The last thing we need is a control program. Add one line to activate the_answer and then finish:

  control {
    the_answer;
  }

If you execute this program, it should do the same thing as the original group-free version: mem ends up with 42 in it. But now we're controlling things with an execution schedule.

If you're curious to see how the Calyx compiler lowers this program to a Verilog-like structural form of Calyx, you can do this:

fud exec language-tutorial-mem.futil --to futil-lowered

Notably, you'll see control {} in the output, meaning that the compiler has eliminated all the control statements and replaced them with continuous assignments in wires.

The complete program for this section is available under examples/tutorial/language-tutorial-control.futil.

Add an Adder

The next step is to actually do some computation. In this version of the program, we'll read a value from the memory, increment it, and store the updated value back to the memory.

First, we will add two components to the cells section:

  cells {
    @external(1) mem = std_mem_d1(32, 1, 1);
    val = std_reg(32);
    add = std_add(32);
  }

We make a register val and an integer adder add, both configured to work on 32-bit values.

Next, we'll create three groups in the wires section for the three steps we want to run: read, increment, and write back to the memory. Let's start with the last step, which looks pretty similar to our the_answer group from above, except that the value comes from the val register instead of a constant:

    group write {
      mem.addr0 = 1'b0;
      mem.write_en = 1'b1;
      mem.write_data = val.out;
      write[done] = mem.done;
    }

Next, let's create a group read that moves the value from the memory to our register val:

    group read {
      mem.addr0 = 1'b0;
      val.in = mem.read_data;
      val.write_en = 1'b1;
      read[done] = val.done;
    }

Here, we use the memory's read_data port to get the initial value out.

Finally, we need a third group to add and update the value in the register:

    group upd {
      add.left = val.out;
      add.right = 32'd4;
      val.in = add.out;
      val.write_en = 1'b1;
      upd[done] = val.done;
    }

The std_add component from the standard library has two input ports, left and right, and a single output port, out, which we hook up to the register's in port. This group adds a constant 4 to the register's value, updating it in place. We can enable the val register with a constant 1 because the std_add component is combinational, meaning its results are ready "instantly" without the need to wait for a done signal.

We need to extend our control program to orchestrate the execution of the three groups. We will need a seq statement to say we want to the three steps sequentially:

  control {
    seq {
      read;
      upd;
      write;
    }
  }

Try running this program again. The memory's initial value was 10, and its final value after execution should be 14.

The complete program for this section is available under examples/tutorial/language-tutorial-compute.futil.

Iterate

Next, we'd like to run our little computation in a loop. The idea is to use Calyx's while control construct, which works like this:

while <value> with <group> {
  <body>
}

A while loop runs the control statements in the body until <value>, which is some port on some component, becomes zero. The with <group> bit means that we activate a given group in order to compute the condition value that determines whether the loop continues executing.

Let's run our memory-updating seq block in a while loop. Change the control program to look like this:

  control {
    seq {
      init;
      while lt.out with cond {
        par {
          seq {
            read;
            upd;
            write;
          }
          incr;
        }
      }
    }
  }

This version uses while, the parallel composition construct par, and a few new groups we will need to define. The idea is that we'll use a counter register to make this loop run a fixed number of times, like a for loop. First, we have an outer seq that invokes an init group that we will write to set the counter to zero. The while loop then uses a new group cond, and it will run while a signal lt.out remains nonzero: this signal will compute counter < 8. The body of the loop runs our old seq block in parallel with a new incr group to increment the counter.

Let's add some cells to our component:

    counter = std_reg(32);
    add2 = std_add(32);
    lt = std_lt(32);

We'll need a new register, an adder to do the incrementing, and a less-than comparator.

We can use these raw materials to build the new groups we need: init, incr, and cond. First, the init group is pretty simple:

    group init {
      counter.in = 32'd0;
      counter.write_en = 1'b1;
      init[done] = counter.done;
    }

This group just writes a zero into the counter and signals that it's done. Next, the incr group adds one to the value in counter using add2:

    group incr {
      add2.left = counter.out;
      add2.right = 32'd1;
      counter.in = add2.out;
      counter.write_en = 1'b1;
      incr[done] = counter.done;
    }

And finally, cond uses our comparator lt to compute the signal we need for our while loop. We use a comb group to denote that the assignments inside the condition can be run combinationally:

    comb group cond {
      lt.left = counter.out;
      lt.right = 32'd8;
    }

By comparing with 8, we should now be running our loop body 8 times.

Try running this program again. The output should be the result of adding 4 to the initial value 8 times, so 10 + 8 × 4.

The complete program for this section is available under examples/tutorial/language-tutorial-iterate.futil.

Multi-Component Designs

Calyx designs can define and instantiate other Calyx components that themselves encode complex control programs.

As an example, we'll build a Calyx design that uses a simple Calyx component to save a value in a register and use it in different component.

We define a new component called identity that has an input port in and an output port out.

component identity(in: 32) -> (out: 32) {
  cells {
    r = std_reg(32);
  }
  wires {
    group save {
      r.in = in;
      r.write_en = 1'd1;
      save[done] = r.done;
    }

    // This component always outputs the current value in r
    out = r.out;
  }
  control {
    save;
  }
}

The following line defines a continuous assignment, i.e., an assignment that is always kept active, regardless of the component's control program being active.

    // This component always outputs the current value in r
    out = r.out;

By defining this continuous assignment, we can execute our component and later observe any relevant values.

Next, we can instantiate this component in any other Calyx component. The following Calyx program instantiates the id component and uses it to save a value and observe it.

component main() -> () {
  cells {
    // Instantiate the identity element
    id = identity();
    current_value = std_reg(32);
  }
  wires {
    group run_id {
      // We want to "save" the value 10 inside the identity group.
      id.in = 32'd10;
      // All components have a magic "go" and "done" port added to them.
      // Execute the component.
      id.go = 1'd1;
      run_id[done] = id.done;
    }
    group use_id {
      // We want to "observe" the current value saved in id.
      // The out port on the `id` component always shows the last saved
      // element. We don't need to set the `go` because we're not executing
      // and control.
      current_value.in = id.out;
      current_value.write_en = 1'd1;
      use_id[done] = current_value.done;
    }
  }
  control {
    seq { run_id; use_id; }
  }
}

Our first group executes the component by setting the go signal for the component to high and placing the value 10 on the input port. The second group simply saves the value on the output port. Importantly, we don't have to set the go signal of the component to high because we don't need to save a new value into it. The component executes the two groups in-order.

To see the output from running this component, run the command:

fud e examples/futil/multi-component.futil --to vcd_json

Passing Memories by Reference

One question that may arise when using Calyx as a backend is how to pass a memory "by reference" between components. In C++, this might look like:

#include <array>
#include <cstdint>

// Adds one to the first element in `v`.
void add_one(std::array<uint32_t, 1>& v) {
  v[0] = v[0] + 1;
}

int main() {
  std::array<uint32_t, 1> x = { 0 };
  add_one(x); // The value at x[0] is now 1.
}

In the code above, we've constructed an "l-value reference" to the array, which essentially means we can both read and write from x in the function add_one.

Now, let's allow similar functionality at the Calyx IR level. We define a new component named add_one which represents the function above. However, we also need to include the correct ports to both read and write to x:

Read from xWrite to x
read_datadone
address portswrite_data
write_en
address ports

Since we're both reading and writing from x, we'll include the union of the columns above:

component add_one(x_done: 1, x_read_data: 32) ->
                 (x_write_data: 32, x_write_en: 1, x_addr0: 1) {

One tricky thing to note is where the ports belong, i.e. should it be an input port or an output port of the component? The way to reason about this is to ask whether we want to receive signal from or send signal to the given wire. For example, with read_data, we will always be receiving signal from it, so it should be an input port. Conversely, address ports are used to mark where in memory we want to access, so those are used as output ports.

We then simply use the given ports to both read and write to the memory passed by reference. Note that we've split up the read and write to memory x in separate groups, to ensure we can schedule them sequentially in the execution flow. We're also using the exposed ports of the memory through the component interface rather than, say, x.write_data.

    group read_from_x {
      x_addr0 = 1'd0;            // Set address port to zero.
      tmp_reg.in = x_read_data;  // Read the value at address zero.
      tmp_reg.write_en = 1'd1;
      read_from_x[done] = tmp_reg.done;
    }
    group write_to_x {
      x_addr0 = 1'd0;            // Set address port to zero.
      add.left = one.out;
      add.right = tmp_reg.out;   // Saved value from previous read.

      x_write_data = add.out;    // Write value to address zero.
      x_write_en = 1'd1;         // Set write enable signal to high.

      write_to_x[done] = x_done; // The group is done when the write is complete.
    }

Bringing everything back together, the add_one component is written accordingly:

component add_one(x_done: 1, x_read_data: 32) ->
                 (x_write_data: 32, x_write_en: 1, x_addr0: 1) {
  cells {
    one = std_const(32, 1);
    add = std_add(32);
    tmp_reg = std_reg(32);
  }
  wires {
    group read_from_x {
      x_addr0 = 1'd0;            // Set address port to zero.
      tmp_reg.in = x_read_data;  // Read the value at address zero.
      tmp_reg.write_en = 1'd1;
      read_from_x[done] = tmp_reg.done;
    }
    group write_to_x {
      x_addr0 = 1'd0;            // Set address port to zero.
      add.left = one.out;
      add.right = tmp_reg.out;   // Saved value from previous read.

      x_write_data = add.out;    // Write value to address zero.
      x_write_en = 1'd1;         // Set write enable signal to high.

      write_to_x[done] = x_done; // The group is done when the write is complete.
    }
  }
  control {
    seq { read_from_x; write_to_x; }
  }
}

The final step is creating a main component from which the original component will be invoked. In this step, it is important to hook up the proper wires in the call to invoke to the corresponding memory you'd like to read and/or write to:

  control {
    invoke add_one0(x_done = x.done, x_read_data = x.read_data)
                   (x_write_data = x.write_data, x_write_en = x.write_en, x_addr0 = x.addr0);
  }

This gives us the main component:

component main() -> () {
  cells {
    add_one0 = add_one();
    @external(1) x = std_mem_d1(32, 1, 1);
  }
  wires {
  }
  control {
    invoke add_one0(x_done = x.done, x_read_data = x.read_data)
                   (x_write_data = x.write_data, x_write_en = x.write_en, x_addr0 = x.addr0);
  }
}

To see this example simulated, run the command:

fud e examples/futil/memory-by-reference/memory-by-reference.futil --to dat \
-s verilog.data examples/futil/memory-by-reference/memory-by-reference.futil.data

Multi-dimensional Memories

Not much changes for multi-dimensional arrays. The only additional step is adding the corresponding address ports. For example, a 2-dimensional memory will require address ports addr0 and addr1. More generally, an N-dimensional memory will require address ports addr0, ..., addr(N-1).

Multiple Memories

Similarly, multiple memories will just require the ports to be passed for each of the given memories. Here is an example of a memory copy (referred to as mem_cpy in the C language), with 1-dimensional memories of size 5:

import "primitives/std.lib";
component copy(dest_done: 1, src_read_data: 32, length: 3) ->
              (dest_write_data: 32, dest_write_en: 1, dest_addr0: 3, src_addr0: 3) {
  cells {
    lt = std_lt(3);
    N = std_reg(3);
    add = std_add(3);
  }
  wires {
    comb group cond {
      lt.left = N.out;
      lt.right = length;
    }
    group upd_index<"static"=1> {
      add.left = N.out;
      add.right = 3'd1;
      N.in = add.out;
      N.write_en = 1'd1;
      upd_index[done] = N.done;
    }
    group copy_index_N<"static"=1> {
      src_addr0 = N.out;
      dest_addr0 = N.out;
      dest_write_en = 1'd1;
      dest_write_data = src_read_data;
      copy_index_N[done] = dest_done;
    }
  }
  control {
    while lt.out with cond {
      seq {
        copy_index_N;
        upd_index;
      }
    }
  }
}

component main() -> () {
  cells {
    @external(1) d = std_mem_d1(32,5,3);
    @external(1) s = std_mem_d1(32,5,3);
    length = std_const(3, 5);
    copy0 = copy();
  }
  wires {
  }
  control {
    seq {
      invoke copy0(dest_done=d.done, src_read_data=s.read_data, length=length.out)
                  (dest_write_data=d.write_data, dest_write_en=d.write_en, dest_addr0=d.addr0, src_addr0=s.addr0);
    }
  }
}

Attributes

Calyx has an attribute system that allows information to be associated with every basic Calyx construct. This information can then be used to optimize the program or change how the program is compiled.

Attributes can decorate lots of things in Calyx: components, groups, cells, ports, and control statements. The syntax looks like name<"attr"=value> for components and groups or @attr(value) for other constructs. Attributes always map keys to values. Because it's common to have a "Boolean" attribute that always maps to the value 1, the syntax @attr is a shorthand for @attr(1).

Here is the syntax for attributes in different parts of the AST:

Component and Port Attributes

component main<"static"=10>(@go go: 1) -> (@done done: 1) {
 ...
}

Cell Attributes

cells {
  @external mem = std_mem_d1(32, 8, 4);
  reg = std_reg(32);
  ...
}

Group Attributes

group cond<"static"=1> {
  ...
}

Control Attributes

control {
  @static(3) seq {
    @static(1) A;
    @static(2) B;
  }
}

Meaning of Attributes

external

The external attribute has meaning when it is attached to a cell. It has two meanings:

  1. If the externalize pass is enabled, the cell is turned into an "external" cell by exposing all its ports through the current component and rewriting assignments to the use the ports. See the documentation on See externalize for more information.
  2. If the cell is a memory and has an external attribute on it, the verilog backend (-b verilog) generates code to read <cell_name>.dat to initialize the memory state and dumps out its final value after execution.

static(n)

Can be attached to components, groups, and control statements. They indicate how many cycles a component, group, or control statement will take to run and are used by -p static-timing to generate more efficient control FSMs.

go, done, and reset

These three ports are part of the interface to Calyx components. They are the mechanism for how an "outer" component invokes an "inner" cell that it contains.

The go and done attributes are, in particular, used by the infer-static-timing pass to configure which ports are used like go and done signals. Along with the static(n) attribute, this allows the pass to calculate when a particular done signal of a primitive will be high.

share

Can be attached to a component and indicates that a component can be shared across groups. This is used by the -p resource-sharing to decide which components can be shared.

bound(n)

Used in infer-static-timing and static-timing when the number of iterations of a While control is known statically, as indicated by n.

generated

Added by ir::Builder to denote that the cell was added by a pass.

clk

Marks the special clock signal inserted by the clk-insertion pass, which helps with lowering to RTL languages that require an explicit clock.

write_together(n)

Used by the papercut pass. Defines a group n of signals that all must be driven together:

primitive std_mem_d2<"static"=1>[WIDTH, D0_SIZE, D1_SIZE, D0_IDX_SIZE, D1_IDX_SIZE](
  @write_together(2) addr0: D0_IDX_SIZE,
  @write_together(2) addr1: D1_IDX_SIZE,
  @write_together(1) write_data: WIDTH,
  @write_together(1) @go write_en: 1,
  ...
) -> (...);

This defines two groups. The first group requires that write_en and write_data signals together while the second requires that addr0 and addr1 are driven together.

Note that @write_together specifications cannot encode implication of the form "if port x is driven then y should be driven".

read_together(n)

Used by the papercut pass. Defines a group n in which when the read port is used then all the write ports must be driven as well.

primitive std_mem_d1<"static"=1>[WIDTH, SIZE, IDX_SIZE](
  @read_together(1) addr0: IDX_SIZE, ...
) -> (
  @read_together(1) read_data: WIDTH, ...
);

This requires that when read_data is used then addr0 must be driven. Note that each group must have exactly one output port in it.

Emitting Calyx from Python

Our frontends are written in Python3 and make use of the calyx library to generate their code.

To install the library, run the following from the repository root (requires flit installation):

cd calyx-py && flit install -s

The library provides an example:

python calyx-py/test/example.py

Building a Frontend for Calyx

In this tutorial, we're going to build a compiler for a small language called MrXL.

MrXL Overview

MrXL provides constructs to create arrays, and perform map and reduce operations on those arrays. Here's an example of a dot product implementation in MrXL:

input avec: int[4]
input bvec: int[4]
output dot: int
prodvec := map 1 (a <- avec, b <- bvec) { a * b }
dot := reduce 1 (a, b <- prodvec) 0 { a + b }

A map expressions produces a new vector, each element an evaluated expression that can use elements of other vectors. In the above example, the map expression multiplies the values of avec and bvec. These expressions also have parallelism factors: in the above code snippet, the map expression has a parallelism factor of 16, which means we stamp out 16 multipliers to speed up the computation.

reduce expressions walk over memories and accumulate a result into a register. In the above code snippet, we add together all of the elements of prodvec and place them in a register named dot.

Arrays can be input arrays, which we populate with some input data, or output arrays, which the program will populate for us.

Run a MrXL Program

First, you'll need to have the MrXL stage installed for fud. See the MrXL docs.

MrXL programs need to be fed data in the form of JSON files. Let's try to run this program, which has a parallelism factor of two:

input foo: int[4]
output baz: int[4]
baz := map 2 (a <- foo) { a + 5 }

with this data file, containing banked memories to allow for the parallelism:

{
  "foo_b0": {
    "data": [
      1,
      2
    ],
    "format": {
      "numeric_type": "bitnum",
      "is_signed": false,
      "width": 32
    }
  },
  "foo_b1": {
    "data": [
      3,
      4
    ],
    "format": {
      "numeric_type": "bitnum",
      "is_signed": false,
      "width": 32
    }
  },
  "baz_b0": {
    "data": [
      0,
      0
    ],
    "format": {
      "numeric_type": "bitnum",
      "is_signed": false,
      "width": 32
    }
  },
  "baz_b1": {
    "data": [
      0,
      0
    ],
    "format": {
      "numeric_type": "bitnum",
      "is_signed": false,
      "width": 32
    }
  }
}

Run the program with the supplied data by typing:

fud exec frontends/mrxl/test/add.mrxl --from mrxl --to vcd -s verilog.data frontends/mrxl/test/add.mrxl.data

Build a Compiler for MrXL

This guide will walk you through the steps to build a Python program that compiles MrXL programs to Calyx code. To simplify things, we'll make a few assumptions:

  • Every array in a MrXL program has the same length.
  • Every integer in our generated hardware will be 32 bits.
  • Every map and reduce body will be either a multiplication or addition of either an array element or an integer.

The following sections will outline these two high level tasks:

  1. Parse MrXL into a representation we can process with Python
  2. Generate Calyx code

Parse MrXL into an AST

To start, we'll parse this MrXL program into a Python AST representation. We chose to represent AST nodes with Python dataclasss. Our toplevel AST node looks like this:

@dataclass
class Prog:
    decls: List[Decl]
    stmts: List[Stmt]

Decl nodes correspond to array declarations like input avec: int[1024], and carry data about whether they're an input or output array, their name, and their type:

@dataclass
class Decl:
    input: bool  # Otherwise, output.
    name: str
    type: Type

Stmt nodes represent statements such as dot := reduce 4 (a, b <- prodvec) 0 { a + b }, and contain more nested nodes representing their function header and body, and type of operation.

Now we can decide on rules for generating code depending on which AST node we're working on. Depending on the AST node, we might need to add code to cells, wires or control.

Generate Calyx Code

The skeleton of a Calyx program has three sections, and looks like this:

component main() -> {
  cells { }
  wires { }
  control { }
}

cells contains declarations for logical hardware units like adders, memories and registers. wires contains groups that connect together the units declared in cells and form the structure of the hardware. control contains the logic specifying when the groups will perform their computation. Walking the nodes of the AST we defined earlier, we'll generate strings that we'll insert into each of these sections. The next few sections will discuss the different node types.

Decl nodes

Decl nodes instantiate new memories and registers. We need these to be instantiated in the cells section of our Calyx output. Here's Calyx code that creates a new memory foo, with 4 32-bit elements and a 32-bit indexor:

foo = std_mem_d1(32, 4, 32);

For each Decl node, we need to determine if we're instantiating a memory or a register, and then translate that to a corresponding Calyx declaration and place that inside the cells section of our generated program. Here's some code from our compiler that walks through each register and memory declaration, and generates a Calyx program with those registers:

# size that we'll assume for the rest of the program's arrays.
arr_size = None

# Collect banking factors.
name2par = dict()
for stmt in prog.stmts:
    if isinstance(stmt.op, ast.Map):
        name2par[stmt.dest] = stmt.op.par

(emit_mem_decl emits a string of the form "mem_name = std_mem_d1(<element_width>, <num_elements>, <index_width>)".)

Map and Reduce nodes

For every map or reduce node, we need to generate Calyx code that iterates over an array, performs some kind of computation, and then stores the result of that computation. For map operations, we'll perform a computation on an element of an input array, and then store the result in a result array. For reduce operations, we'll also use an element of an input array, but we'll also use an accumulator register that we'll use in each computation, and we'll also store to. For example, if we were writing a reduce that summed up the elements of an input array, we'd use an accumulator register that was initialized to hold the value 0, and add to the value of this register each element of an input array.

We can implement these behaviors using Calyx groups:

  • incr_idx: Increments an idx register using an adder. This group is done when the idx register is written to.
  • cond: Applies a "less than" operator to idx, and the length of our input arrays, using the le hardware unit.
  • eval_body: Reads from an array, performs some kind of computation, and writes the result of the computation to an accumulator register or another array.

We'll make these groups for each Map and Reduce node, so to avoid naming collisions, we'll suffix each group with an integer starting at 0, incrementing each time we need to add a new set of groups. These groups will be added to the wires section. We'll also need to add logic to the control section as well that uses these groups to process arrays:

while le0.out with cond0 {
  seq { eval_body0; incr_idx0; }
}

This logic orchestrates our groups, basically representing iterating over our array and evaluating some computation on each element of the array. On each iteration we signal for the eval_body0 group to do its work, followed sequentially by incr_idx0 to advance our index register so that we can work on the next element of the array.

Add Parallelization

MrXL allows you to parallelize your map and reduce operations. Let's revisit the map example from earlier:

input foo: int[4]
output baz: int[4]
baz := map 4 (a <- foo) { a + 5 }

The number 4 after the map specifies the number of adders we can use at once to parallelize this computation. There are a few ways we could parallelize this program, and one of them is to split the memories used in the map operation into 4 separate memory banks, and then we can read from each bank of foo and write into each bank of baz simultaneously. In general, we can break memories of size m into b banks (each with size m/b), and then simultaneously process those b banks. Realizing this in Calyx means creating separate memories for each bank, and creating groups to process each bank. Here's a section of the compiler that generates banked memories:

    ConstantPort, HolePort, CompPort, Enable, While, ParComp, CombGroup)


def emit_mem_decl(name, size, par):
    """
    Returns N memory declarations,
    where N = `par`.
    """
    stdlib = Stdlib()
    banked_mems = []
    for i in range(par):
        banked_mems.append(
            Cell(
                CompVar(f"{name}_b{i}"),
                stdlib.mem_d1(32, size // par, 32),

In the Map and Reduce code generation section we described groups that could be orchestrated to iterate over a memory and process it. We'll now have to do that for each memory bank, and then parallelize these operations in the generated Calyx's control section. We can accomplish this with Calyx's par keyword, signalling to execute groups in parallel. Here's an example of executing four while loops in parallel:

par {
  while le_b0.out with cond_b0 { seq { eval_body_b0; incr_idx_b0; } }
  while le_b1.out with cond_b1 { seq { eval_body_b1; incr_idx_b1; } }
  while le_b2.out with cond_b2 { seq { eval_body_b2; incr_idx_b2; } }
  while le_b3.out with cond_b3 { seq { eval_body_b3; incr_idx_b3; } }
}

Conclusion

Hopefully this should be enough to get you started with writing your own MrXL compiler. Some more follow up tasks you could try if you're interested:

  • Implement code generation to implement reduce statements, which we do not include in our compiler.
  • Implement code generation that allows memories that differ from one another in size.
  • Implement complex function body expressions. We only support binary operations with simple operands, like a + 5. Different hardware components take multiple cycles to execute: for example, a register takes 1 cycle to write data to, but a memory might take more. This complicates hardware design, as you need to account for differing latencies among hardware components.

Frontend Compilers

Several compiler generate Calyx programs from other high-level languages.

Dahlia

Dahlia is an imperative, loop-based programming language for designing hardware accelerators.

Installation

First, install sbt and scala.

Then, clone the repository and build the Dahlia compiler:

git clone https://github.com/cucapra/dahlia.git
cd dahlia
sbt install
sbt assembly
chmod +x ./fuse

The Dahlia compiler can be run using the ./fuse binary:

./fuse --help

Finally, configure fud to use the Dahlia compiler:

fud c stages.dahlia.exec <path to Dahlia repository>/fuse

Use fud to check if the compiler was installed correctly:

fud check

fud should report that the Dahlia compiler is available and has the right version.

If something went wrong, try following the instructions to build the Dahlia compiler from its repository.

Compiling Dahlia to Calyx

Dahlia programs can be compiled to Calyx using:

fud e --from dahlia <input file> --to futil

The Dahlia backed for Calyx is neither complete nor stable. If you find a confusing error or wrong program, please open an issue.

Systolic Array

Systolic arrays are commonly used to implement fast linear-algebra computations. See this paper for an overview on systolic arrays.

The systolic array frontend lives in the systolic-lang folder in the Calyx repository and generates systolic arrays that can perform matrix multiplies.

The gen-systolic.py contains the entire program required to generate systolic arrays. In order to generate an 8 X 8 systolic array, run:

./frontends/systolic-lang/gen-systolic.py -tl 8 -td 8 -ll 8 -ld 8

Installation

Install the calyx-py library.

Command Line Options

The command line options configure the dimensions of the generated systolic array. There are no other properties of the systolic array that can be configured.

  • --top-length, --left-length: The length of top and left sides of the systolic array.
  • --top-depth, --left-depth: The length of the input streams from top and left sides of the array.

TVM Relay

TVM is a compiler for machine learning frameworks that can optimize and target kernels to several different backends. Relay is a high level intermediate representation for the TVM framework. The goal of Relay is to replace old computation graph based IRs with a more expressive IR. More information can be found in this paper.

The TVM Relay frontend lives in the relay-lang folder in the Calyx repository and generates Calyx components from the Relay intermediate representation.

Installation

  1. Clone the TVM repository with commit hash ccacb1ec1):

     git clone --recursive git@github.com:apache/incubator-tvm.git
     cd incubator-tvm && git checkout ccacb1ec1
    
  2. Set up to build (the default configuration is fine because we don't need any fancy backends like LLVM or CUDA):

     mkdir build && cd build
     cp ../cmake/config.cmake .
    
  3. Build TVM:

     cmake -G Ninja .. && ninja
    
  4. Install the tvm Python package by building a wheel:

     cd ../python && python3 setup.py bdist_wheel
     pip3 install --user dist/tvm-*.whl
    
  5. Install the accompanying topi Python package:

     cd ../topi/python && python3 setup.py bdist_wheel
     pip3 install --user dist/topi-*.whl
    
  6. Install ANTLR v4.7.2 (required for the Relay text format parser):

     pip3 install -Iv antlr4-python3-runtime==4.7.2
    
  7. To run the MLP net and VGG net examples, install pytest:

     pip3 install pytest
    
  8. Install Dahlia, which is used when lowering Relay call nodes to Calyx.

  9. Install the calyx-py library.

Run an Example

Try this to run a simple example:

cd calyx/frontends/relay
python3 example.py tensor_add
  • -h: Help option; shows available examples.
  • -r: Dumps the Relay IR. Otherwise, it dumps the Calyx output.

Simulate an ONNX Model

A simple script is provided to run an Open Neural Network Exchange (ONNX) model. In addition to installing TVM Relay above, you'll need the following PIP installations for ONNX simulation and image pre-processing:

pip3 install opencv-python Pillow mxnet onnx simplejson

For example, we can simulate the LeNet ONNX model found here using the following command:

python3 frontends/relay/onnx_to_calyx.py \ 
-n "lenet" \ 
-d "MNIST" \ 
-i "/path/to/image.png" \
-onnx "/path/to/model.onnx" \ 
-o calyx
  • -n: The name of the input net. This is mostly used for naming the output files.
  • -d: The dataset for which the input will be classified against. This is necessary to determine what preprocessing should be done on the image. e.g. "mnist" or "imagenet".
  • -i: The file path to the input image which you want classified.
  • -onnx: The file path to the ONNX model.
  • -o: The type of output.
    1. tvm: Executes the ONNX model using the TVM executor. Prints the final softmax value to console. No postprocessing is conducted.
    2. relay: Output a file with the corresponding Relay program. <net_name>.relay
    3. calyx: Output a .data file and Calyx program for simulation. <net_name>.futil, <net_name>.data
    4. all: All the above.

Number Theoretic Transform (NTT)

The number theoretic transform is a generalization of the fast Fourier transform that uses nth primitive root of unity based upon a quotient ring instead of a field of complex numbers.

It is commonly used to speed up computer arithmetic, such as the multiplication of large integers and large degree polynomials. The pipeline produced here is based upon this paper, which also provides some background information on NTT.

The NTT pipeline frontend lives in the ntt folder in the Calyx repository and generates the pipeline for the NTT transform.

The gen-ntt-pipeline.py file contains the entire program required to generate NTT pipelines. In order to generate a pipeline with bit width 32, input size 4, and modulus value 97:

./frontends/ntt-pipeline/gen-ntt-pipeline.py -b=32 -n=4 -q=97

Installation

Install the calyx-py library.

The generator also produces a table to illustrate which operations are occurring during each stage of the pipeline. This requires installing PrettyTable:

pip3 install prettytable numpy

Fud Stage

The NTT pipeline defines an [external fud stage][../fud/external.md] to transform configuration files into Calyx programs. To install, run:

fud register ntt -p frontends/ntt-pipeline/fud/ntt.py && fud check

This should report the newly installed ntt stage in the configuration.

Configuration Files

Configurations files simply specify command line parameters:

{
  "input_bitwidth": 32,
  "input_size": 4,
  "modulus": 97
}

Command Line Options

The command line options configure the bit width, size, and modulus value of the pipeline.

  • --input_bitwidth: The bit width of each value in the input array.
  • --input_size: The length (or size) of the input array.
  • --modulus: The (prime) modulus value used during the transformation.
  • --parallel_reduction: Decreases fan-out by reducing the number of groups executed in parallel by this factor.

MrXL

The MrXL frontend is a toy frontend developed for the frontend tutorial. As such, it is less rigorously tested and might have bugs.

MrXL is an example DSL for demonstrating Calyx. MrXL programs consist of map and reduce operations on arrays. For example, this is a dot product implementation:

input avec: int[1024]
input bvec: int[1024]
output dot: int
prodvec := map 16 (a <- avec, b <- bvec) { a * b }
dot := reduce 4 (a, b <- prodvec) 0 { a + b }

The numbers that come right after map and reduce are parallelism factors that guide the generation of hardware.

Install

Install the calyx-py library.

The MrXL implementation is in Python and uses Flit. First, install flit (pip install flit or similar), and then type the following inside frontend/mrxl:

flit install --symlink

This creates a symbolic link the mrxl directory and installs the mrxl command line tool.

By default, fud looks for the mrxl executable to enable the mrxl compilation stage. Type fud check to make sure fud reports that the mrxl compiler has been found.

Interpreter

To run the interpreter, do this:

mrxl <program> --data <indata> --interpret

where <program> is a MrXL source code file and <indata> is a JSON file containing values for all the variables declared as input in the program. The interpreter dumps the output variables as JSON to stdout.

You can try this, for example:

mrxl test/dot.mrxl --data test/dot.json --interpret

Compiling to Calyx

To run the compiler, leave off the --interpret and --data flags:

mrxl test/dot.mrxl

In order to run the compiler through fud, pass the --from mrxl flag:

fud e --from mrxl <mrxl file> --to futil

To simulate the Verilog generated from the mrxl compiler, set the -s verilog.data as usual:

fud e --from mrxl <mrxl file> --to dat -s verilog.data <data file>

Optimizations

This section documents Calyx optimizations that have been implemented.

Dataflow Optimizations

In general, dataflow analysis uses the control and data flow of a program to compute various properties (liveness, reaching definitions, ...) at each point in a program.

For Calyx, dataflow analyses use the explicit control program and knowledge about the dataflow of each group to compute properties about each group.

Basic blocks vs. Groups

Normally, dataflow analyses compute a property at each basic block of a control flow graph (CFG). Calyx doesn't have a notion of basic blocks, and so Calyx computes a property at each group in a program.

Because Calyx separates the control flow of a program from the specification of groups, it's possible for a group to appear multiple times in the control program. For this reason we compute a property at each group enable rather than each group definition. The property at each group definition can easily be computed as the meet over all group enables.

Dataflow on an AST

Dataflow analyses are typically performed by finding the fixed point of a set of equations defined at each node of a control flow graph (CFG) using the worklist algorithm.

Because our control AST is little more than just the edges of a reducible cfg, we don't bother to build an explicit CFG and instead perform the dataflow analysis directly on the AST using Calyx's visitor infrastructure.

Abstract Algorithm

We model each control statement s as a function, f: p -> p where p is the type of the property. Control statements that have children define how information flows between it's children.

Enable

f for enable A is similar to the transfer function in standard dataflow analysis. It uses information from the definition of group A to modify the input in some way. For example, if p is the set of live variables, the enable f is defined as:

f(enable A, inputs) = (inputs - kill(A)) | gen(A)

Seq

seq defines sequential control flow edges between it's children. It is implemented by threading it's input throw all of it's children to produce an output.

f(seq { A; B; C; ...; Z; }, inputs) =
     f(A, inputs)
  |> f(B, _)
  |> f(C, _)
  |> ...
  |> f(Z, _)

To implement a backwards dataflow analysis, all you need to do is reverse the order that seq pipes inputs to it's children:

// reverse
f(seq { A; B; C; ...; Z; }, inputs) =
     f(Z, inputs)
  |> ...
  |> f(C, _)
  |> f(B, _)
  |> f(A, _)

If

if passes it's inputs to it's condition group and then feeds the result of this to both of it's children. The output is the union of the outputs of both of it's children. This is standard.

f(if some.port with G { True; } else { False; }, inputs) =
  f(True, f(G, inputs)) | f(False, f(G, inputs))

While

while statements are interesting because the outputs of the body may affect the input to the body. For this reason, we need to find a fixed point:

f(while some.port with G { body; }, inputs) =
  P = inputs;
  loop until P stops changing {
    P = f(body, f(G, inputs))
  }

Par

Par is the only statement that differs substantially from traditional dataflow because control flow graphs don't support nodes running in parallel. In other words, there is only ever one thing executing. However, par changes this and allows multiple things to execute at the same time. Consider the following example where we are computing the liveness of x to see why this is weird:

F; // wr x
...
par {
  A; // wr x
  B;
}
G; // rd x

Is x alive between X and the beginning of par? The answer is no because we know that both A and B will run. Therefore the write to x in F can not be seen by any group.

At first glance this doesn't seem like a problem. Great, we say, we can just take the union of the outputs of the children of par and call it a day.

This is wrong because B doesn't kill x and so x is alive coming into B. The union preserves this information and results in x being alive above par.

Taking the set intersection is not quite right here either. Consider adding another group C that reads from x.

We have no information about how this read is ordered with the write to x in A so we have to assume that x is alive above par. If we take the intersection here:

  live(A) & live(B) & live(C)
= {} & {x} & {x}
= {}

We get the wrong answer. More generally, we can see that union clobbers any writes and intersection clobbers any reads that happen in the par.

The solution to this problem is solved by passing the gen and kill sets along with the live sets. Then par can set it's output to be

(union(live(children)) - union(kill(children))) | union(gen(children))

The final tricky bit is thinking about how information flows between siblings in a par statement. Consider again the picture above with three nodes: A, B, and C. Should x be live at B? For liveness it turns out to be yes, but bare with me for a second for a thought experiment and consider the case where we have the guarantee that statements running in parallel can not interact with each other. This lets us reorder statements in some cases. Then there seems to be an information trade-off for how to define the liveness of x at B:

  • You could say that x is dead at B because it doesn't see any previous writes to x and doesn't read from x. This implies that you could potentially replace writes to other registers with x. However, this by itself would cause x to be written to twice in parallel. You would have to reorder B to run before A. The takeaway here is that calling x dead at B gives the register reuse pass more information to work with. To compute this information B needs the gens and kills from all of it's siblings (for the same reason that par) needed it. This is not particularly hard to implement, but it's worthy of noting.
  • If you say that x is live at B, then you can never rewrite B to use x instead of some other register, but you also don't have to worry about reordering statements in a par.

Leaving thought experiment land, in our case we can never reorder statements in a par because siblings may interact with each other in arbitrary ways. For this reason, we must say that x is live at B. However, it's not immediately clear to me that this will be true of all dataflow analyses. That's why I set this thought experiment down in writing.

Equivalence to worklist algorithm

In the normal worklist algorithm, we add a statement back to the worklist when it's predecessor has changed.

The intuition for why this algorithm is equivalent to worklist algorithm is that because the only entry into the children for each parent control statement is the parent control statement itself. The only way that a child statement would need to be recomputed is if the inputs to the parent need to be recomputed. Anything above the parent in the tree will take care of this re-computation.

Debugging Tips

Debugging Calyx programs that run to completion and generate the wrong value can be challenging. We can first try to eliminate some common causes of problems.

Disabling Optimizations

The first step is disabling optimization passes. The Calyx compiler refers to bundles of optimizations using two aliases: pre-opt and post-opt. pre-opt passes run before the main compilation passes that remove the control program while post-opt passes run after.

To disable the passes, add the flag -d pre-opt -d post-opt to compiler invocation:

  1. For the compiler: futil <filename> -d pre-opt -d post-opt.
  2. For fud: fud ... -s futil.flags "-d pre-opt -d post-opt".

If the execution generates the right result, then one of the optimizations passes is incorrect. To identify which optimization pass is wrong, add back individual passes and see when the execution fails. To do so, first run futil --list-passes to see the names of the passes that make up the pre-opt and post-opt aliases. Next, re-enable each pass by doing: -d pre-opt -d post-opt -p <pass1> -p <pass2> and so on.

Disabling Static Timing

static-timing is one of the two control compilation passes. It uses the latency information on groups to generate faster hardware. Disable it using the flag -d static-timing.

Reducing Test Files

It is often possible to reduce the size of the example program that is generating incorrect results. In order to perform a reduction, we need to run the program twice, once with a "golden workflow" that we trust to generate the right result and once with the buggy workflow.

For example, if we've identified the problem to be in one of the Calyx passes, the "golden workflow" is running the program without the pass while the buggy workflow is running the program with the pass enabled. This case is so common that we've written a script that can run programs with different set of flags to the Calyx compiler and show the difference in the outputs after simulation.

The script is invoked as:

tools/flag-compare.sh <calyx program> <data>

By default, the script will try to run the programs by simulating them through Verilator by providing fud with the target --to dat. If you'd like to use the Calyx Interpreter instead, run the following command:

tools/flag-compare.sh <calyx program> <data> interpreter-out

Reducing Calyx Programs

The best way to reduce Calyx program deleting group enables from the control program and seeing if the generated program still generates the wrong output. While doing this, make sure that you're not deleting an update to a loop variable which might cause infinite loops.

By default, the compiler will complain if the program contains a group that is not used in the control program which can get in the way of minimizing programs. To get around this, run the dead-group-removal pass before the validation passes:

futil -p dead-group-removal -p validate ...

Reducing Dahlia Programs

If you're working with Dahlia programs, it is also possible to reduce the program with the script since it simply uses fud to run the program with the simulator. As with Calyx reduction, try deleting parts of the program and seeing if the flag configurations for the Calyx program still generate different outputs.

Waveform Debugging

Waveform debugging is the final way of debugging Calyx programs. A waveform captures the value of every port at every clock cycle and can be viewed using a wave viewer program like GTKWave or WaveTrace to look at the wave form. Because of this level of granularity, it generates a lot of information. To make the information a little more digestible, we can use information generated by Calyx during compilation.

For waveform debugging, we recommend disabling the optimization passes and static timing compilation (unless you're debugging these passes). In this debugging strategy, we'll do the following:

  1. Dump out the control FSM for the program we're debugging.
  2. Find the FSM states that enable the particular groups that might be misbehaving.
  3. Open the waveform viewer and find clock cycles where the FSM takes the corresponding values and identify other signals that we care about.

Consider the control section from examples/futil/dot-product.futil:

    seq {
      let0;
      while le0.out with cond0 {
        seq {
          par {
            upd0;
            upd1;
          }
          let1;
          let2;
          upd2;
          upd3;
        }
      }
    }

Suppose that we want to make sure that let0 is correctly performing its computation. We can generate the control FSM for the program using:

  futil <filename> -p top-down-cc

This generates a Calyx program with several new groups. We want to look for groups with the prefix tdcc which look something like this:

group tdcc {
  let0[go] = !let0[done] & fsm.out == 4'd0 ? 1'd1;
  cs_wh.in = fsm.out == 4'd1 ? le0.out;
  cs_wh.write_en = fsm.out == 4'd1 ? 1'd1;
  cond0[go] = fsm.out == 4'd1 ? 1'd1;
  par[go] = !par[done] & cs_wh.out & fsm.out == 4'd2 ? 1'd1;
  let1[go] = !let1[done] & cs_wh.out & fsm.out == 4'd3 ? 1'd1;
  let2[go] = !let2[done] & cs_wh.out & fsm.out == 4'd4 ? 1'd1;
  upd2[go] = !upd2[done] & cs_wh.out & fsm.out == 4'd5 ? 1'd1;
  upd3[go] = !upd3[done] & cs_wh.out & fsm.out == 4'd6 ? 1'd1;
  ...
}

The assignments to let0[go] indicate what conditions make the let0 group execute. In this program, we have:

let0[go] = !let0[done] & fsm.out == 4'd0 ? 1'd1;

Which states that let0 will be active when the state of the fsm register is 0 along with some other conditions. The remainder of the group defines how the state in the fsm variable changes:

  ...
  fsm.in = fsm.out == 4'd0 & let0[done] ? 4'd1;
  fsm.write_en = fsm.out == 4'd0 & let0[done] ? 1'd1;
  fsm.in = fsm.out == 4'd1 & cond0[done] ? 4'd2;
  fsm.write_en = fsm.out == 4'd1 & cond0[done] ? 1'd1;
  ...

For example, we can see that when the value of the FSM is 0 and let0[done] becomes high, the FSM will take the value 1.

Once we have this information, we can open the VCD file and look at points when the fsm register has the value 1 and check to see if the assignments in let0 activated in the way we expected.

Contributors

Here is a list of all the people who have worked on Calyx:

Current Contributors

Previous Contributors

If you're missing from this list, please add yourself!