Final Project

Due Dates

For the final project, you’ll be completing the following deliverables:

Proposal: Due Friday, November 15, 11:59pm (Gradescope)
Checkpoint 1: Due Friday, November 22, 11:59pm
Checkpoint 2: Due Wednesday, November 27, 11:59pm
Checkpoint 3: Due Friday, December 6, 11:59pm
Presentation: Takes place Tuesday, December 10, 2:00pm-4:00pm
Report: Due Tuesday, December 10, 11:59pm

The document below provides more details on what you’ll be turning in.

Overview

Now that we’ve wrapped up lab assignments for the semester, we’ve set aside these last several weeks of 6.S894 for you to work on an open-ended project on a topic of your choice. This final project is meant to give you an opportunity to exercise what you’ve learned, to learn more about topics we didn’t have a chance to cover deeply in the labs, and to get a better sense of what software engineering for GPUs and other accelerators looks like in the real world. We hope you end up having fun with it!

Here’s an overview of final project logistics:

Teams:
- Students should complete final projects in teams of 2-4 people.
Topic Selection:
- Every team gets to choose their own topic to work on.
  - We provide some suggestions later in this document, but you’re allowed – and encouraged – to pursue ideas which aren’t on that list.
- Project topics should be challenging, but not (necessarily) novel. A final project focused on reimplementing something which already exists is totally okay.
- There will be a lightweight topic proposal and approval process, so that you get a chance to check in with the course staff before you start working on your project.
Hardware and Tools:
- We’ll continue to provide access to an NVIDIA RTX A4000 GPU which you can use to complete your final projects.
  - You’ll have more flexible access to this GPU than you’ve previously had through Telerun. You’ll be able to submit multi-file projects, provide your own input data files, control the build process, etc.
- Additionally, you can use any other hardware you have access to.
  - You’re welcome to use any non-NVIDIA accelerators that you have access to! For example, you could do a project involving FPGAs, or AMD GPUs, or various Apple Silicon accelerators. The key constraints are that the course staff won’t be able to provide the same level of support, and everyone on your team should have access to the hardware to be able to contribute.
- You’re welcome to explore using higher-level CUDA libraries, as well as software tools beyond CUDA. E.g. you might use CUB, CUTLASS, or Triton.

Part 1: Topic Selection

You’re welcome to propose any final project topic you like related to accelerated computing, and the course staff will work with you during the approval process to make sure the scope of your project is appropriate.

Final project topics should be significantly more challenging than a lab assignment, and we’ll expect larger teams to take on somewhat more ambitious projects than smaller teams.

Example

As an example of how you might choose your final project topic, you could start with an idea like:

Core idea (good start; not sufficient on its own):
“We could implement the forward pass of FlashAttention.”

You could then come up with possible extensions to this core idea which would make it more interesting:

Possible extensions:

“We could implement both the forward and backward passes of FlashAttention.”

“We could aim to achieve 90% of the performance of the best FlashAttention implementation we can find online.”

“We could aim to efficiently support irregular and dynamic attention masks.”

“We could implement a version of FlashAttention using a technique like PagedAttention to more efficiently handle batches with irregular sequence lengths.”

“We could add support for low-precision (e.g. 4-bit-quantized) inputs with special techniques to preserve accuracy, as in this paper.”

“We could integrate our FlashAttention implementation with PyTorch and measure how it affects the end-to-end performance of a transformer implementation.”

“We could explore opportunities for fusing our FlashAttention implementation with other nearby operations in a transformer layer, such as a rotary position embedding (RoPE) operator.”

The appropriate scope for your project depends on the size of your team:

For a team of 2 people, choosing 2-3 extensions from the above list would correspond to the right level of difficulty.
For a team of 3-4 people, choosing 3-4 extensions from the above list would correspond to the right level of difficulty.

(Note: If the above FlashAttention ideas sound interesting, you’re welcome to actually pursue this for your final project if you want!)

Proposals

Final project proposals should be submitted on Gradescope (link) in PDF format, and should contain the following parts:

A list of team members.
A brief description of your idea, in 1-2 paragraphs.
An explanation of the resources you plan to use for your project.

Example

The following shows an example of a complete, successful final project proposal:

Final Project Proposal (Example)

Team Members: Alice, Bob, Carol

Description: We plan to implement a FlashAttention forward and backward pass at bfloat16 precision, targeting RTX A4000 GPUs. We will aim for both our forward and backward passes to achieve 90% of the throughput of the best FlashAttention implementation we can find online, evaluated at reasonable problem sizes. We will integrate our FlashAttention implementation with PyTorch and measure its affect on the end-to-end latency of a single training step for a transformer model.

Resources: We will develop and benchmark our kernels using the RTX A4000 GPU provided by the course. Additionally, Carol has access to an NVIDIA A6000 GPU through her lab, which we may use to run NSight Compute when debugging the performance of our kernels.

Approval Process

After you submit your final project proposal, the course staff will try to get back to you within 1-2 days to either…

Immediately approve your proposed topic, or…
Work with you to refine the scope of your proposed topic.

If the course staff doesn’t immediately approve your proposal, we may try to schedule a meeting with members of your team to discuss ways your proposal could be extended or pared down. After the course staff has worked with you to develop a revised proposal, you can consider your proposal approved and can start working on your project.

All correctly-formatted project proposals submitted before the November 15 deadline will receive full credit for the proposal component of the final project, regardless of whether or not they are immediately approved.

Topic Suggestions

Although we encourage you to come up with your own ideas for final projects, we also have a list of topic areas you might find it helpful to consider when writing your proposals.

The topics we suggest can be roughly broken down into two categories:

Performance engineering projects, where the goal is to develop an implementation of some workload which runs as fast as possible.
Investigation / reverse-engineering projects, where the goal is to deeply understand the performance characteristics and microarchitectural details of the hardware. Because these details are often undocumented, doing this will likely require running a lot of carefully-controlled experiments to try to determine how the hardware works empirically.

This “performance engineering” / “investigation” distinction isn’t perfectly sharp, and it’s fine to do a project which straddles both categories.

With those categories in mind, here are some of the topic areas you may want to consider. These don’t necessarily all constitute complete project ideas; you may need to combine several ideas from this list, or elaborate on an idea with your own extensions, to arrive at a complete project proposal:

Performance engineering:
- Matrix-multiply-like workloads:
  - (See also: “Further Reading” in Lab 6.)
  - Achieve 90% of cuBLAS performance across a range of matrix multiply problem sizes.
  - Implement reduced-precision matrix multiply kernels.
  - Explore using higher-level libraries like CUTLASS (link) or ThunderKittens (link).
  - Explore using Triton (link).
    - Explore the implementation details of how Triton generates code for matrix-multiply-like workloads.
    - Explore modifying / extending the Triton compiler. (Ambitious!)
  - Explore advanced matrix multiply scheduling strategies like many-stage software pipelines (link) or “stream-k” partitioning (link)
  - Implement FlashAttention.
  - Implement sparse linear algebra workloads.
    - Explore using the structured sparsity features of tensor cores. (link)
    - Implement sparse-sparse or sparse-dense matrix-matrix or matrix-vector multiplication.
      - Compare to performance of cuSPARSE. (link)
- Graphics and simulation workloads:
  - Explore accessing the GPU via graphics APIs, e.g. Vulkan (link), rather than CUDA.
  - Implement a ray tracing renderer on NVIDIA RTX GPUs, using e.g. the OptiX API (link).
  - Develop fast pure-CUDA implementations of ray tracing, triangle mesh rendering, etc.
  - Explore differentiable rendering: NeRF (link), Gaussian splatting (link), etc.
  - Explore mesh processing on the GPU, as in the work of our guest lecturer Ahmed Mahmoud (link).
  - Implement a fluid or soft-body simulation using the material point method (link).
  - Implement a fluid dynamics simulation by e.g. numerically solving the Navier-Stokes equations (link), using lattice boltzmann methods (link), etc.
  - Implement a finite-element-based PDE solver on a mesh (link).
- Miscellaneous workloads:
  - Bioinformatics: e.g. DNA sequence alignment based on the Needleman-Wunsch algorithm (link).
  - Databases: e.g. parallelizing execution of join queries on the GPU, as in this paper.
  - Signal processing: e.g. implement a fast Fourier transform (link) on the GPU.
  - Pseudo-random number generation: explore methods of generating high-quality pseudo-random numbers on the GPU at high throughput; compare tradeoffs of different designs for different workloads.
  - Sorting: implement a fast sorting algorithm on the GPU. (See this survey.)
  - Graph processing: e.g. pathfinding, optimal partitioning. (See this survey.)
- Advanced concurrency and synchronization:
  - Explore synchronization primitives available on the GPU:
    - Atomics (link)
    - Memory barriers
      - PTX-level (link)
      - C++-level (link)
    - Cooperative groups (link).
      - Cross-threadblock and whole-grid synchronization.
    - PTX memory consistency model (link)
  - Implement core concurrent data structures on the GPU:
    - Mutual exclusion locks, read-write locks
    - Concurrent FIFO queues
    - Concurrent hash tables
    - Concurrent dynamic memory allocators
      - Using built-in GPU-side malloc (link).
      - Using your own custom concurrent allocator.
Investigation / reverse engineering:
- Study this third-party experimental report about the Volta microarchitecture, and run similar experiments to figure out how some of the corresponding details work on the newer Ampere architecture.
  - What is the banking configuration of the register file on Ampere?
  - Can you observe effects related to the existence of the “register reuse cache” described in the report?
  - What are the microarchitectural details of the L1 and L2 caches?
    - This includes: latency, sector size, tag lookup throughput, associativity, etc.
    - Can you observe effects related to the existence of the “L1.5 cache” described in the report?
- Investigate details of threadblock scheduling:
  - E.g. what algorithm does the GPU use to assign blocks to SMs when different blocks take different amounts of time to complete?
- Investigate aspects of the memory system:
  - Instruction cache: Investigate the performance characteristics of the instruction cache, identify cases where it is a bottleneck for performance, and explore mitigations. (link)
  - L1 cache:
    - Investigate how the microarchitecture of the L1 cache upholds the PTX memory consistency model (link) in the presence of synchronization instructions like atomics. Are cache lines in L1 ever defensively flushed/evicted to enforce memory consistency, and if so, when? Does the GPU appear to use any kind of cache coherence protocol to keep different SMs’ L1 caches in sync?
      - Note from course staff: we really hope someone will investigate this one! Please contact us if you’re interested and we’ll explain the precise problem in more detail to help you get started.
  - L1 as shared memory:
    - Can multiple warps within a block overlap to get full bandwidth by coalescing their requests into shared? E.g. can you get full shared memory bandwidth at double the latency with one warp accessing odd banks and another warp accessing even banks?
    - Can you experimentally confirm / refute the interaction between vectorized loads and shared memory banking which we speculatively described in the instructions for Lab 5?
  - L2 cache:
    - What’s the deal with all the different ways that CUDA allows programmers to configure the L2, regions of memory in L2, and accesses to L2? See this documentation. What are the implications of these features for developing high-performance CUDA kernels?
      - In particular, what’s the deal with the documentation saying that L2 caching policy is determined by RNG?
    - How do different SMs interact with each other and with L2 when accessing global memory?
      - How many SMs are necessary in order to saturate DRAM/L2 at peak bandwidth? Can you do it with just one SM, with all other SMs sitting idle? If not, what is the peak bandwidth achievable with just a single SM?
    - Do stores to DRAM always pay the full cost of loading from DRAM first, or can stores that fully overwrite a whole cache line pay only a single request’s worth of bandwidth cost?
- Investigate details of the GPU’s frequency scaling behavior.
  - Under what circumstances does the GPU scale up or scale down its clock frequency? How deterministic / reproducible is this behavior across multiple trials? What implications does this have for performance benchmarking?
- Investigate aspects of the CUDA compiler and SASS:
  - How do non-inlined function calls work at the SASS level? What ABI does CUDA use for function calls?
  - Partially reverse-engineer the instruction encoding used in the SASS binary format.
    - Someone has done this for Hopper (Compute Capability 9.0): link.
    - As far as the course staff is aware, nobody has done this for our RTX A4000 GPU (Compute Capability 8.6).
  - Write a code generator or assembler which directly targets SASS, without going through PTX.
    - Even a code generator for a very specialized application could be interesting; e.g. a code generator which directly emits highly-optimized SASS corresponding to the inner loop of a matrix multiply.
    - Over a decade ago, someone built a SASS assembler for the Fermi microarchitecture (link). As far as the course staff aware, nobody has done this for any NVIDIA GPU microarchitecture since then.
Hardware-constrained projects:
- (These are not possible using course-provided infrastructure alone; you’ll need to supply your own hardware.)
- Using Hopper features: (If you have access to a Hopper GPU like an H100)
  - Explore using tensor cores in 8-bit floating point precision.
  - Explore using the “Tensor Memory Accelerator.” (link)
  - Explore using “wgmma” tensor core instructions. (link)
  - Explore using “Thread Block Clusters.” (link)
- Distributed communication between GPUs: (If you have access to a machine with multiple GPUs, ideally with NVLink. Note that MIT’s Satori cluster may be a plausible resource here – it has older V100 GPUs, but there are 4 per node interconnected with NVLink and 100Gb Infiniband across nodes.)
  - Implement your own version of some or all of the “collective operations” from NVIDIA’s NCCL library (link) from scratch, e.g. AllGather, ReduceScatter, AllReduce.
  - Compare the performance of communication over PCIe vs NVLink interconnect.
  - Explore methods of overlapping cross-GPU communication with computation, including approaches based on fused kernels such as FLUX (link).

Part 2: Project Implementation

After your proposal is approved, you’ll have the last approximately three weeks of class available to work on final projects. During this time, the class will continue to meet in-person during live lab to give you time to work as a team and to discuss your projects with course staff.

We’ll ask each student to submit three low-stakes checkpoint assignments on Gradescope to let us know how they’re doing with the final project. These checkpoints will be graded roughly the same way as lab checkpoints, and are mostly a way for the course staff to identify ways in which teams are stuck and to help them get un-stuck.

Infrastructure

We have released a new, more flexible version of Telerun for teams to develop their final projects. More information about this new course infrastructure is coming soon. Thanks for your patience!

Part 3: Final Presentation and Report

During live lab time on Tuesday, December 10, every team will present a brief overview of what they achieved in their final project. Due to time constraints, presentations will necessarily be short, with an anticipated time budget of exactly 5 minutes per presentation.

Additionally, on the same day as final presentations, your team will submit a final report including:

All code to reproduce the experiments in your project.
A write-up in PDF format describing:
- The objectives you set for your final project.
  - In the case of a performance engineering project, any analysis you performed, microbenchmarks you ran, or baseline implementations you used to determine the performance targets you should try to hit. Since there won’t be a staff baseline of comparison, it will now be your job to provide evidence to convince us (and yourselves!) why the performance you reached is “good.”
- The design of any code you wrote, including a discussion of your design process and any alternative designs you explored.
- The results of any experiments you ran.
- A discussion of your results, including any limitations of your implementation or experiments, and directions for future work.
- A related work section covering existing publicly-available software, papers, blog posts, etc. relevant to your project.

We intend to share each final report with the whole class, so that any student who finds another team’s presentation interesting can learn about that team’s work in greater depth.