Building a High-Performance .NET AWS SDK

In 2025, I started to build Goa which is a high-performance, minimal-overhead framework for building AWS Lambda functions with native AOT support. I wanted to share some of the performance optimisation journey with you.

The project started during the AWS SDK v3.5 era, when I noticed that API calls from Lambda were slower than expected. I started out diagnosing the HTTP connection issues and random stalls that we were seeing. Before long, I had a complete rewrite of the signature mechanism and a completely new lambda runtime implementation for .NET. I've since continued this as a passion project and used it as the basis for my projects to ensure that it works for production use cases.

This post walks through the different performance techniques I used across four component areas, with real code from the repository. Whether or not you ever use Goa, these techniques apply to most .NET projects where allocations matter. The official AWS SDK is well-designed and battle-tested but it has to be everything for everyone, so this makes their jobs as SDK producers a lot harder. The current Goa benchmarks in this post are against AWSSDK.Core v4.0.3.6, the latest at time of writing.

Why Performance Matters in .NET

Before diving into specific techniques, it's worth understanding why reducing allocations is one of the biggest levers for .NET performance in GC-sensitive workloads like Lambda. For this post, we're looking at .NET 10.

The .NET garbage collector organises objects into three generations. Gen 0 collects short-lived objects which is cheap and fast. Gen 1 is a buffer zone. Gen 2 collects long-lived objects and is the most expensive. Even with background GC (the default since .NET 4.5), which runs the sweep phase concurrently, Gen 2 still requires stop-the-world pauses during marking and compaction. Every allocation adds pressure. When Gen 0 fills up, a collection runs. Objects that survive get promoted to Gen 1, then Gen 2. The more objects you allocate, the more frequently collections fire, and the more likely you are to trigger Gen 2 pauses.

When you're working with Lambda at real scale (tens of millions of operations/day), this matters even more:

Pay-per-millisecond billing: GC pauses are dead time you're paying for. Even 5ms of additional overhead per invocation across 10 million daily operations on a 1GB function adds up to ~$0.83/day or ~$25/month in wasted compute (at $0.0000166667 per GB-second).
Constrained memory: Lambda functions run with 128MB to 10GB of memory. In a 256MB function, the GC has far less headroom before it hits its memory budget and triggers collections. Lambda ties CPU to memory, so each collection also takes longer wall-clock time on less CPU.
Short-lived containers: The GC tunes itself over time based on your allocation patterns (aka DATAS or Dynamic Adaptation To Application Sizes which is on by default since .NET 9). Lambda functions may be recycled before the GC has time to settle into an efficient mode. AWS has a dedicated guide on configuring the .NET GC for Lambda for exactly this reason.

Almost every technique in this post comes back to the same thing: allocate less, and the GC has less work to do, so we have faster, more predictable performance at a lower price point.

Request Signing

Every AWS API call requires Signature Version 4 (SigV4 hereafter) signing. This involves hashing the request, building a canonical representation, and computing an HMAC chain. Standard implementations create dozens of intermediate strings, byte arrays, and hash buffers per request. When you're making hundreds of API calls per Lambda invocation, this adds up fast.

When I profiled this, I noticed that the AWS SDK was allocating 11kb of memory per signature. I managed to get this down to ~600bytes. We had the nice side effect of better performance as well:

Method	Categories	Mean	Error	StdDev	Ratio	RatioSD	Gen0	Allocated	Alloc Ratio
Goa_SignRequest_Small	Small	3.131 us	0.0531 us	0.0497 us	0.38	0.01	0.0076	568 B	0.05
AwsSdk_SignRequest_Small	Small	8.309 us	0.1661 us	0.2274 us	1.00	0.04	0.2136	11104 B	1.00

Goa_SignRequest_Medium	Medium	6.986 us	0.0893 us	0.0836 us	0.57	0.01	0.0076	568 B	0.05
AwsSdk_SignRequest_Medium	Medium	12.358 us	0.2428 us	0.2385 us	1.00	0.03	0.2136	11104 B	1.00

Goa_SignRequest_Large	Large	45.101 us	0.5048 us	0.4722 us	0.90	0.01	-	568 B	0.05
AwsSdk_SignRequest_Large	Large	50.185 us	0.6561 us	0.6137 us	1.00	0.02	0.1221	11104 B	1.00

Goa_SignRequest_XLarge	XLarge	446.664 us	6.0213 us	5.6323 us	0.99	0.02	-	570 B	0.05
AwsSdk_SignRequest_XLarge	XLarge	451.189 us	6.3464 us	5.6260 us	1.00	0.02	-	11106 B	1.00

Benchmark Link. The benchmark gets as close to an apples-to-apples comparison as possible, though each SDK uses its own request abstractions. Let's take a look at some of the optimisations that we applied to get these performance results. This benchmark is less about the latency, more about the allocations generated.

For those interested, here's the full RequestSigner Source Code.

stackalloc for Fixed-Size Buffers

Any time we know there is something fixed-length, we have an opportunity to use a fixed-sized buffer and potentially avoid the heap entirely. There are two places where SigV4 has fixed sizing:

SHA-256 hashing which always produces 32 bytes
Dates are always fixed sizes (8 characters or 16 characters)

SHA-256 Hashing

SHA256.HashData has two overloads worth knowing about:

// Overload 1: returns a new byte[] which allocates 32 bytes on the heap every call
byte[] hash = SHA256.HashData(payload);

// Overload 2: writes into a caller-provided Span<byte> which has no allocations
Span<byte> hashBuffer = stackalloc byte[32];
SHA256.HashData(payload, hashBuffer);

The first overload is convenient but allocates a byte[32] on the heap every time. Since SHA-256 output is always exactly 32 bytes, we can use stackalloc to place that buffer on the stack and pass it to the span-based overload. The buffer lives on the stack frame and disappears when the method returns with zero GC involvement. You could also rent from ArrayPool<byte> here, but for 32 bytes stackalloc is simpler and faster since there's no rent/return overhead.

Date Formatting

The same principle applies to date formatting. AWS SigV4 requires two date formats:

a short date (YYYYMMDD, 8 characters) used in the credential scope
a long date (YYYYMMDDTHHMMSSZ, 16 characters) used in the X-Amz-Date header.

// Before: ToString allocates a new string on the heap
string shortDate = dateTime.ToString("yyyyMMdd");
string longDate = dateTime.ToString("yyyyMMdd'T'HHmmss'Z'");

Since both are fixed-length, we can use the same strategy as we did for SHA-256 hashing:

// After: write directly into a stack-allocated span
Span<char> shortDate = stackalloc char[8];
Span<char> longDate = stackalloc char[16];
dateTime.TryFormat(shortDate, out _, "yyyyMMdd".AsSpan());
dateTime.TryFormat(longDate, out _, "yyyyMMdd'T'HHmmss'Z'".AsSpan());

DateTime.TryFormat writes directly into the provided span, avoiding the string allocation that ToString would create. Both buffers are small enough not to cause any issues with the stack, so this is a safe optimisation for us to make.

Span<T> for Zero-Copy String Building

SigV4 requires building a canonical request string in a specific format:

HTTPMethod\n
CanonicalURI\n
CanonicalQueryString\n
CanonicalHeaders\n
SignedHeaders\n
HashedPayload

Each component needs its own encoding rules. The URI must be percent-encoded (spaces become %20, unreserved characters pass through), headers must be lowercased and sorted, query parameters must be sorted by key. A naive implementation would build each component as a separate string, then concatenate them together, allocating a new string at every step.

Goa avoids this with a two-pass, measure-then-write pattern. The first pass calculates the exact character count each component will occupy without writing anything. For example, MeasureCanonicalUri walks the URI and counts: unreserved chars and / contribute 1 character each, everything else contributes 3 (%XX). Once we know the total size, we rent a single buffer from ArrayPool and the second pass writes each component directly into that buffer at the correct offset:

// Pass 1: measure each component to calculate total size
var canonUriLen = MeasureCanonicalUri(pathStr);
 
// Precise size calculation for canonical request to minimize buffer waste
var canonicalEstimate =
    methodLen + 1 +
    canonUriLen + 1 +
    (canonQueryLen > 0 ? canonQueryLen : 0) + 1 +
    EstimateHeadersCharCount(headerRefs, totalHeaders, ...) + 1 +
    EstimateSignedHeadersLen(headerRefs, totalHeaders) + 1 +
    64; // payload hash
 
// Rent a single buffer sized to hold the entire canonical request
var rented = ArrayPool<char>.Shared.Rent(canonicalEstimate);
var canonical = rented.AsSpan();
var pos = 0;
 
// Pass 2: write each component directly into the span
var m = request.Method.Method ?? "GET";
m.AsSpan().CopyTo(canonical.Slice(pos));
pos += m.Length;
canonical[pos++] = '\n';
 
pos += WriteCanonicalUri(canonical.Slice(pos), pathStr);
canonical[pos++] = '\n';
 
// ... continue for query string, headers, signed headers, payload hash

This same measure-then-write pattern is repeated for query strings, headers, and every other component in the signing process. No intermediate string objects are ever created. The entire canonical request is built in a single rented buffer, and async work like credential retrieval happens upfront so the hot path is entirely synchronous and span-based.

ArrayPool<T> for Dynamic Buffers

The previous two sections used stackalloc for fixed-size buffers and Span<T> with ArrayPool for building the canonical request. It's worth explaining when to reach for each.

stackalloc allocates on the stack frame and is effectively free, but the size needs to be small and bounded (a few hundred bytes at most, otherwise you risk a stack overflow exception). For SHA-256's 32 bytes or a date's 16 characters, it's perfect.

When the size depends on runtime input (like the length of a URI or the number of query parameters), stackalloc isn't safe. That's where ArrayPool<T>.Shared becomes more applicable. It maintains a pool of reusable arrays. You rent one, use it, and return it. This saves the effort on the GC because the underlying arrays are recycled:

var queryBuffer = ArrayPool<char>.Shared.Rent(estimatedQuerySize);
try
{
    var queryWritten = WriteCanonicalQuery(queryBuffer, queryParts, encodedBuffer);
    // Use queryBuffer[..queryWritten] for the canonical request
}
finally
{
    ArrayPool<char>.Shared.Return(queryBuffer);
}

The Rent/Return pattern does require discipline. If you forget to return, the buffer becomes garbage and you lose the pooling benefit. Wrapping the usage in a try/finally ensures the buffer is always returned, even if an exception is thrown.

Custom Insertion Sort for Small Collections

AWS requests typically have 5-10 headers. The canonical request requires them sorted by name. Using LINQ's OrderBy() would allocate an IComparer, an iterator, and intermediate arrays. Instead, Goa uses a simple insertion sort on a Span<HeaderRef> where HeaderRef is a struct:

// Headers are represented as value types — no heap allocation
private readonly struct HeaderRef
{
    public readonly string Name;
    public readonly HeaderKind Kind;
    public readonly int Index;
}
 
// For <=16 headers, insertion sort avoids all the overhead of
// a general-purpose sort with IComparer allocation
private static void InsertionSortHeaders(Span<HeaderRef> span)
{
    for (var i = 1; i < span.Length; i++)
    {
        var key = span[i];
        var j = i - 1;
        while (j >= 0 && StringComparer.OrdinalIgnoreCase.Compare(
            span[j].Name, key.Name) > 0)
        {
            span[j + 1] = span[j];
            j--;
        }
        span[j + 1] = key;
    }
}

Span<T>.Sort() (available since .NET 5) is the zero-alloc BCL alternative, but it still allocates an IComparer internally for custom comparisons. For 16 or fewer headers, a hand-written insertion sort with direct StringComparer.OrdinalIgnoreCase.Compare calls avoids that overhead entirely and is simpler to reason about. For larger collections, the code falls back to Array.Sort with a cached comparer.

Readonly Struct for SHA-256 Hashes

During SigV4 signing, SHA-256 hashes get passed around between methods: hashing the payload, hashing the canonical request, building the string to sign. This is usually represented as a byte[]:

// A byte[] is a reference type, so every hash is a heap allocation
byte[] hash = SHA256.HashData(payload);

The problem is that byte[] is a reference type. Every hash you compute allocates 32 bytes on the heap, plus another ~24 bytes of array overhead (object header, method table pointer, and length). A single SigV4 signing operation computes multiple SHA-256 hashes, so each request creates several short-lived arrays that the garbage collector has to track and clean up.

The approach taken here was found by Claude. Since a SHA-256 digest is always exactly 32 bytes, and 32 bytes is exactly four ulong values (4 x 8 bytes), we could store the hash as a struct with four ulong fields instead of a byte[]. This means it lives on the stack rather than the heap, reducing GC pressure:

[StructLayout(LayoutKind.Sequential)]
private readonly struct Sha256Hash
{
    private readonly ulong W0, W1, W2, W3;
 
    // Zero-copy access to the underlying bytes
    public ReadOnlySpan<byte> AsBytes
        => MemoryMarshal.AsBytes(
            MemoryMarshal.CreateReadOnlySpan(ref Unsafe.AsRef(in W0), 4)
        );
}

StructLayout(LayoutKind.Sequential) guarantees the fields are contiguous in memory with no padding. Without this, the runtime could reorder or pad the fields, meaning the 32 bytes wouldn't form a single continuous block. With it, W0 through W3 sit back-to-back in memory, so we can safely treat them as a flat 32-byte buffer. That's what the AsBytes property does: MemoryMarshal reinterprets those four ulong fields directly as a ReadOnlySpan<byte> with zero copying, giving us byte-level access for operations like hex encoding without allocating a new byte[]. The Unsafe.AsRef(in W0) call converts the readonly field reference into the ref that CreateReadOnlySpan requires. The resulting span is read-only, so there's no mutation risk.

Marking it as a readonly struct also avoids defensive copies. When you pass a normal struct by in reference, the compiler makes a copy to prevent mutation. With readonly struct, the compiler knows it can't be mutated, so the copy is skipped.

DynamoDB Client

DynamoDB was my primary focus when starting Goa because it's where most of my application code lives. The official AWS SDK V4's DynamoDBContext (the Object Persistence Model) still uses reflection for property discovery, attribute reading, and object construction. There's an open discussion about a source generator alternative, but it doesn't exist yet. V4 moved from LitJson to System.Text.Json for wire serialisation, but the DynamoDB mapping layer itself remains reflection-based. This is flexible but expensive, and partially incompatible with native AOT (nested types can be silently trimmed). If you're interested in serialization performance more broadly, I've covered .NET serialization benchmarks in a separate post.

Source Generator for DynamoDB Mapping

Since it didn't exist in the AWS SDK, I had another "how hard can this be" thought and wrote a source generator so that I could completely eliminate reflection. Goa uses a Roslyn incremental source generator that generates ToDynamoRecord() and FromDynamoRecord() methods at compile time. You decorate your types such as:

[DynamoModel(PK = "USER#<Id>", SK = "PROFILE#<Id>")]
public record UserProfile(string Id, string Name, string Email);

And the generator emits:

// Auto-generated at compile time — no reflection, no runtime cost
public static class DynamoMapper
{
    // Each decorated type gets a nested class
    public static class UserProfile
    {
        public static DynamoRecord ToDynamoRecord(UserProfile model)
        {
            var record = new DynamoRecord();
            record["pk"] = AttributeValue.String($"USER#{model.Id}");
            record["sk"] = AttributeValue.String($"PROFILE#{model.Id}");
            record["Name"] = AttributeValue.String(model.Name);
            record["Email"] = AttributeValue.String(model.Email);
            return record;
        }
 
        public static UserProfile FromDynamoRecord(DynamoRecord record, string? parentPkValue = null, string? parentSkValue = null)
        {
            var pkValue = record.TryGetNullableString("pk", out var pk) ? pk : parentPkValue ?? string.Empty;
            var skValue = record.TryGetNullableString("sk", out var sk) ? sk : parentSkValue ?? string.Empty;
 
            return new UserProfile(
                string.IsNullOrEmpty(pkValue)
                    ? MissingAttributeException.Throw<string>("Id", pkValue, skValue)
                    : pkValue,
                record.GetString("Name"),
                record.GetString("Email")
            );
        }
    }
}

The source generator handles inheritance hierarchies with type discriminators, Global Secondary Index key patterns with template placeholders, nullable property handling with conditional DynamoDB attribute assignment, and complex nested types. You can adjust the values of PK/SK attribute names via the attributes as well.

JSON Source Generation

All DynamoDB wire communication uses System.Text.Json source generation. Instead of using reflection at runtime to discover how to serialize each type, the source generator emits optimized readers and writers at compile time. This is both faster (no reflection overhead) and required for native AOT compatibility. The DynamoJsonContext registers ~50 types for compile-time serialisation, covering every request, response, and model type:

[JsonSourceGenerationOptions(
    DefaultIgnoreCondition = JsonIgnoreCondition.WhenWritingNull,
    PropertyNamingPolicy = JsonKnownNamingPolicy.Unspecified,
    GenerationMode = JsonSourceGenerationMode.Default,
    UseStringEnumConverter = true
)]
[JsonSerializable(typeof(AttributeValue))]
[JsonSerializable(typeof(DynamoRecord))]
[JsonSerializable(typeof(GetItemRequest))]
[JsonSerializable(typeof(GetItemResponse))]
[JsonSerializable(typeof(PutItemRequest))]
[JsonSerializable(typeof(PutItemResponse))]
[JsonSerializable(typeof(QueryRequest))]
[JsonSerializable(typeof(QueryResponse))]
// ... ~50 types total for DynamoDb
internal partial class DynamoJsonContext : JsonSerializerContext;

Each service client in Goa has its own JsonSerializerContext rather than a single shared one. This keeps the generated code focused and avoids the compile-time cost of a monolithic context. With 22 contexts across the project and ~50 types in the DynamoDB context alone, this is a non-trivial amount of generated code, but it means every serialisation path is AOT-safe by construction. The .NET team's performance blog posts go into detail on the improvements they've made to the source-generated serializer.

AttributeValue as a Readonly Struct

This was another optimisation that Claude suggested, and I'm glad it did because the allocation reduction was significant. In DynamoDB, every cell in every row is an AttributeValue. When you query 100 items with 10 attributes each, that's 1,000 AttributeValue instances. As a class, each one is a separate heap allocation with object header overhead. As a readonly struct, they're inlined directly into the parent collection, so there's no per-value heap allocation at all.

In PR #70, AttributeValue was converted from a class to a readonly struct with factory methods:

public readonly struct AttributeValue
{
    // Factory methods instead of constructors - clear intent, no allocation
    public static AttributeValue String(string value)
    {
        ArgumentNullException.ThrowIfNull(value);
        return new(value, AttributeType.String);
    }
 
    public static AttributeValue Number(string value)
    {
        ArgumentNullException.ThrowIfNull(value);
        return new(value, AttributeType.Number);
    }
 
    public static AttributeValue Bool(bool value) => new(value);
}

Factory methods are used instead of public constructors so that each creation site reads clearly (AttributeValue.String("hello") vs new AttributeValue("hello", AttributeType.String)). This is purely an API design choice, not a performance one.

The benchmark results from the PR tell the story. Before the change, a 100-item query allocated 201.68 KB in Goa vs 454.51 KB in the AWS SDK. After the change, that dropped to 153.85 KB, a ~24% reduction in allocations from a single type change:

Method	Allocated	Alloc Ratio
Goa (before)	201.68 KB	0.44
Goa (after)	153.85 KB	0.38
AWS SDK	454.51 KB / 407.56 KB	1.00

I wouldn't reach for this as a first step though. Normally, you'd write it the simple way first, profile to find what actually matters, try pooling or caching, and only then consider structural changes like switching from a class to a struct. In this case, the profiling showed AttributeValue was one of the highest-volume allocations, which made it a clear candidate.

HTTP Pipeline — Pooling and Async Optimisation

The HTTP pipeline handles every request and response. Small inefficiencies here compound across thousands of calls.

PooledBufferWriter

PR #67 introduced PooledBufferWriter — an IBufferWriter<byte> backed by ArrayPool. It rents from the shared pool on construction and returns on disposal:

public sealed class PooledBufferWriter : IBufferWriter<byte>, IDisposable
{
    private byte[] _buffer;
    private int _written;
 
    public PooledBufferWriter(int initialCapacity)
    {
        _buffer = ArrayPool<byte>.Shared.Rent(initialCapacity);
    }
 
    public ReadOnlySpan<byte> WrittenSpan => _buffer.AsSpan(0, _written);
    public ReadOnlyMemory<byte> WrittenMemory => _buffer.AsMemory(0, _written);
 
    public void Advance(int count) => _written += count;
 
    public Span<byte> GetSpan(int sizeHint = 256)
    {
        EnsureCapacity(sizeHint);
        return _buffer.AsSpan(_written);
    }
 
    public void Dispose()
    {
        ArrayPool<byte>.Shared.Return(_buffer, clearArray: true);
    }
}

This pattern is perfect for building HTTP request bodies. The buffer grows as needed by renting larger arrays from the pool, and clearArray: true on disposal prevents data leakage. If you're interested in the IDisposable pattern more generally, I've covered using IDisposable correctly in another post.

The second half of the zero-copy pipeline is ReadOnlyMemoryContent. Once the request body is serialized into the pooled buffer, we wrap it directly as HTTP content without copying:

// Zero-copy: ReadOnlyMemoryContent borrows the buffer writer's memory
var memory = bufferWriter.WrittenMemory;
if (memory.Length > 0 && method != HttpMethod.Get)
{
    var content = new ReadOnlyMemoryContent(memory);
    content.Headers.ContentType = contentType;
    requestMessage.Content = content;
}

ReadOnlyMemoryContent references the pooled buffer's memory slice directly -- no byte[] copy. The request body bytes that JsonSerializer wrote into the pooled buffer are the same bytes that HttpClient sends over the wire. The ReadOnlyMemoryContent object itself is still a heap allocation (it's a class), but the payload data, which is the large part, is never duplicated. The only requirement is that the buffer writer outlives the request, which the using scope in the calling method guarantees.

BCL SIMD Alternatives

PR #65 replaced hand-written scalar hex conversion and lowercase operations with BCL methods that leverage SIMD under the hood:

// Before: scalar bit manipulation (see earlier BytesToHex example)
BytesToHex_Scalar(_hash, hexOut);
 
// After: BCL method that uses SIMD when available
Convert.TryToHexStringLower(_hash, hexOut, out _);

Similarly for lowercase conversion:

// Before: char-by-char loop
for (var i = 0; i < s.Length; i++)
    dest[i] = char.ToLowerInvariant(s[i]);
 
// After: SIMD-accelerated (processes 16+ chars at once on modern CPUs)
System.Text.Ascii.ToLower(headerName, dest, out var written);

The Goa benchmarks include a dedicated HexAndLowercaseBenchmarks suite that compares the scalar implementations against the BCL alternatives. Convert.TryToHexStringLower outperforms the hand-written bit manipulation for SHA-256 hex encoding, and System.Text.Ascii.ToLower beats character-by-character conversion for both short headers (Content-Type, 12 chars) and longer ones (X-Amz-Content-Sha256, 20 chars). The gap widens as string length increases because the BCL methods can process 16+ characters per instruction on modern CPUs via SIMD. The BCL team has done the hard work of writing platform-specific implementations; there's no reason to maintain your own.

ValueTask Synchronous Fast Paths

For operations that often complete synchronously like the earlier example of returning a pre-computed payload hash, Goa uses ValueTask to avoid the async state machine allocation:

private static ValueTask<Sha256Hash> ComputePayloadHashAsync(HttpRequestMessage request)
{
    // No content — hash the empty payload synchronously
    if (request.Content is null)
    {
        Span<byte> hash = stackalloc byte[32];
        SHA256.HashData(ReadOnlySpan<byte>.Empty, hash);
        return ValueTask.FromResult(Sha256Hash.FromBytes(hash));
    }
 
    // Pre-computed payload available — complete synchronously
    if (request.Options.TryGetValue(HttpOptions.Payload, out var payload)
        && payload.Length > 0)
    {
        return ValueTask.FromResult(ComputePayloadHashFromBytes(payload.Span));
    }
 
    // Only create the async state machine when we actually need to stream
    return ComputePayloadHashFromStreamAsync(request);
}

When either fast path hits, ValueTask.FromResult() returns without allocating a Task or creating an async state machine. The async machinery only kicks in for the rare case where content needs to be streamed from the request body.

Zero-Alloc Logging, Timing, and Pipeline Micro-Optimizations

PR #66 tackled allocations across the entire HTTP pipeline. The biggest win was replacing Dictionary<string, object> log scopes with custom readonly structs.

Struct-Based Log Scopes

Every call to Logger.BeginScope() in the HTTP pipeline was allocating a Dictionary<string, object> with 8 entries. That means a dictionary, its internal arrays, hash buckets, and boxed values on every single request:

// Before: Dictionary allocates on the heap every call
using var logContext = Logger.BeginScope(new Dictionary<string, object>
{
    ["Client"] = _clientType,
    ["Region"] = Configuration.Region,
    ["Service"] = Configuration.Service,
    ["SigningService"] = Configuration.SigningService,
    ["Target"] = target,
    ["ApiVersion"] = Configuration.ApiVersion,
    ["Method"] = request.Method.ToString(),
    ["Uri"] = request.RequestUri?.ToString() ?? "Unknown"
});

The replacement is a pair of readonly struct types, LogScope4 and LogScope8, that implement IReadOnlyList<KeyValuePair<string, object>>. This is the specific interface that Microsoft.Extensions.Logging looks for when extracting structured properties from a scope:

// After: readonly struct — lives on the stack, zero heap allocation
using var logContext = Logger.BeginScope(new LogScope8(
    new("Client", _clientType),
    new("Region", Configuration.Region),
    new("Service", Configuration.Service),
    new("SigningService", Configuration.SigningService),
    new("Target", target),
    new("ApiVersion", Configuration.ApiVersion),
    new("Method", request.Method.Method),
    new("Uri", request.RequestUri?.AbsoluteUri ?? "Unknown")
));

The struct stores all 8 key-value pairs as individual fields (_0 through _7) with a switch-expression indexer. It also has its own nested struct Enumerator, so a duck-typed foreach avoids boxing the enumerator too. LogScope4 uses a _count field to support 2, 3, or 4 entry constructors for smaller scopes elsewhere in the codebase.

Stopwatch Allocation

Stopwatch.StartNew() allocates a Stopwatch object on the heap with internal state fields for _elapsed, _startTimeStamp, and _isRunning. Replacing it with the static GetTimestamp() / GetElapsedTime() pair (available since .NET 7) gives the same high-resolution timing from a single long:

// Before: allocates a Stopwatch object per request
var sw = Stopwatch.StartNew();
var response = await client.SendAsync(request, cancellationToken);
Logger.RequestComplete(Configuration.LogLevel, sw.Elapsed.TotalMilliseconds);
 
// After: just a long on the stack
var start = Stopwatch.GetTimestamp();
var response = await client.SendAsync(request, cancellationToken);
Logger.RequestComplete(Configuration.LogLevel, Stopwatch.GetElapsedTime(start).TotalMilliseconds);

Cached Base URI

Every call to CreateRequestMessage was building the base URL string via interpolation and parsing it into a new Uri(...). The base URL is identical across all calls for a given client instance, so it only needs to be parsed once:

// Field on the client class
private Uri? _cachedBaseUri;
 
// In CreateRequestMessage:
var baseUri = _cachedBaseUri ??= new Uri(
    Configuration.ServiceUrl ??
    $"https://{Configuration.Service.ToLower()}.{Configuration.Region}.amazonaws.com/");
 
// Subsequent requests use the two-arg constructor — scheme, host, port
// are already parsed in baseUri
Uri finalUri = requestUri == "/"
    ? baseUri
    : new Uri(baseUri, requestUri.TrimStart('/'));

Death by a Thousand Allocations

The same PR also cleaned up several smaller allocation sources across the pipeline:

request.Method.Method instead of request.Method.ToString() -- .Method returns the string directly; .ToString() goes through virtual dispatch for the same result.
request.RequestUri?.AbsoluteUri instead of .ToString() -- Uri lazily caches its string representation. AbsoluteUri returns that cache; ToString() can re-compute and allocate via UriHelper.GetComponents().
Inline filtering instead of LINQ .Where() -- response header filtering was using .Where(x => x.Key.StartsWith("x-amz", ...) || x.Key.StartsWith("x-amzn", ...)) which allocates an enumerator. A plain if inside the foreach body does the same work with zero allocation. As a bonus, "x-amzn" starts with "x-amz", so the second predicate was redundant.
GetStatusCodeString() lookup instead of boxing -- the original ["StatusCode"] = (int)response.StatusCode boxed the int into an object. A switch-expression lookup returns pre-allocated string literals ("200", "404", etc.) for common status codes, avoiding both the boxing and a .ToString() allocation.

AOT Compatibility

Native AOT compiles your .NET application directly to machine code ahead of time. The result is near-instant startup (critical for Lambda cold starts) but with a constraint: no runtime reflection. Every reflection call is a potential runtime failure.

Solution-Wide AOT Configuration

The Directory.Build.props at the solution root enables AOT compatibility for all library projects:

<PropertyGroup>
    <IsAotCompatible>true</IsAotCompatible>
</PropertyGroup>

This turns on the AOT compatibility analyser across the entire solution. Any code that uses reflection gets a build warning.

DynamicallyAccessedMembers for DI

Generic methods that resolve types through dependency injection need the [DynamicallyAccessedMembers] attribute to tell the trimmer which members to preserve:

public static ILambdaBuilder WithInitializationTask<
    [DynamicallyAccessedMembers(
        DynamicallyAccessedMemberTypes.PublicConstructors
    )] T
>(this ILambdaBuilder builder)
    where T : class, ILambdaInitializationTask
    => builder.WithServices(services =>
        services.TryAddEnumerable(
            ServiceDescriptor.Singleton<ILambdaInitializationTask, T>()
        ));

Without this attribute, the trimmer would strip the constructor of T since it's only accessed via Activator.CreateInstance at runtime. The attribute preserves exactly what's needed. If you're interested in how C# duck typing and hidden features work with the compiler, this is a similar concept where the compiler needs hints about runtime behaviour.

JSON Serialiser Contexts Everywhere

Across the entire Goa project, there are 21+ JsonSerializerContext implementations. One for each service client, Lambda event type, and internal model. Every JSON serialisation path goes through a source-generated context.

Templates Ship with PublishAot

All Goa project templates include <PublishAot>true</PublishAot> by default. When you run dotnet new goa.apigw, you get a project that's AOT-ready out of the box.

Getting Started

If you want to try all of this out yourself and help give feedback on what I've done so far, you can install the templates via NuGet:

dotnet new install Goa.Templates
 
Template Name                      Short Name          Language  Tags
---------------------------------  ------------------  --------  ----------------------------------
Lambda - API Gateway Authorizer    goa.authorizer      [C#]      AWS/Lambda/Serverless/Goa/Security
Lambda - API Gateway Function      goa.apigw           [C#]      AWS/Lambda/Serverless/Goa
Lambda - CloudWatch Logs Function  goa.cloudwatchlogs  [C#]      AWS/Lambda/Serverless/Goa
Lambda - DynamoDB Function         goa.dynamodb        [C#]      AWS/Lambda/Serverless/Goa
Lambda - EventBridge Function      goa.eventbridge     [C#]      AWS/Lambda/Serverless/Goa
Lambda - Generic Function          goa.lambda          [C#]      AWS/Lambda/Serverless/Goa
Lambda - Kinesis Function          goa.kinesis         [C#]      AWS/Lambda/Serverless/Goa
Lambda - S3 Function               goa.s3              [C#]      AWS/Lambda/Serverless/Goa
Lambda - SQS Function              goa.sqs             [C#]      AWS/Lambda/Serverless/Goa

Then you can create a new project with the templates, eg: A API Gateway Lambda function:

dotnet new goa.apigw -n "MyFirstGoaProject"
cd MyFirstGoaProject

The generated project includes a minimal Lambda function with dependency injection, JSON source generation, and native AOT configured:

await Host.CreateDefaultBuilder()
    .UseLambdaLifecycle()
    .ForAspNetCore(app =>
    {
        app.UseRouting();
        app.UseEndpoints(endpoints =>
        {
            endpoints.MapGet("/ping", static async context =>
            {
                context.Response.ContentType = "application/json";
                await JsonSerializer.SerializeAsync(
                    context.Response.Body, new Pong("PONG!"),
                    HttpSerializerContext.Default.Pong);
            });
        });
    }, apiGatewayType: ApiGatewayType.HttpV2)
    .WithServices(services =>
    {
        services.ConfigureHttpJsonOptions(options =>
        {
            options.SerializerOptions.TypeInfoResolver =
                HttpSerializerContext.Default;
        });
    })
    .RunAsync();
 
public record Pong(string Message);
 
[JsonSourceGenerationOptions(
    WriteIndented = false,
    PropertyNamingPolicy = JsonKnownNamingPolicy.CamelCase,
    DictionaryKeyPolicy = JsonKnownNamingPolicy.CamelCase,
    UseStringEnumConverter = true
)]
[JsonSerializable(typeof(Pong))]
public partial class HttpSerializerContext : JsonSerializerContext
{
}

Check the Goa.Templates directory for all options.

Wrapping Up

That's 17 techniques across four component areas:

Request Signing: stackalloc for fixed buffers, Span<T> measure-then-write, ArrayPool<T> for dynamic buffers, insertion sort for small collections, readonly struct Sha256Hash
DynamoDB Client: Roslyn source generator, JSON source generation, readonly struct AttributeValue
HTTP Pipeline: PooledBufferWriter with ReadOnlyMemoryContent, BCL SIMD alternatives, ValueTask fast paths, struct log scopes, Stopwatch.GetTimestamp, cached base URI, micro-allocation cleanup
AOT Compatibility: IsAotCompatible, DynamicallyAccessedMembers, JsonSerializerContext everywhere, PublishAot templates

The underlying mindset is straightforward: measure first, optimise what matters, and lean on the BCL wherever possible. The .NET runtime team has put extraordinary effort into making the standard library fast; use it.

Goa is in pre-release so APIs may change. If you hit issues or have ideas, open an issue or contribute directly. If any of this is useful to you, give Goa a star, try it in a project, or contribute. Every bit helps.