Ship a SageMaker Real-Time SLM API

This post shows a pragmatic pattern for deploying TinyLlama/TinyLlama-1.1B-Chat-v1.0 behind a SageMaker real-time endpoint, then fronting it with API Gateway + Lambda and an API key so you can call it like a normal HTTPS API.

Code: slm_sagemaker

What you’ll end up with:

A /invoke endpoint you can curl
Usage-plan throttling + API key authentication
A simple request/response shim in Lambda
Clear diagrams so you can reason about the pieces

⚠️ COST WARNING: This deployment uses a real-time ml.g5.xlarge GPU instance that runs 24/7. The endpoint continues billing even when not processing requests, and pricing varies by region—so destroy the stack when you're not using it: make destroy PROFILE=ml-sage REGION=eu-west-2

Architecture

System Components

The following diagram illustrates the AWS components that make up this real-time SLM solution:

architecture-beta
    group api(cloud)[API Layer]
    group compute(cloud)[Compute Layer]
    group ml(cloud)[ML Infrastructure]

    service gateway(internet)[API Gateway] in api
    service apikey(disk)[API Key] in api
    service lambda(server)[Lambda Function] in compute
    service sagemaker(server)[SageMaker Endpoint] in ml
    service model(database)[TinyLlama Model] in ml

    gateway:R --> L:lambda
    apikey:B --> T:gateway
    lambda:R --> L:sagemaker
    sagemaker:B --> T:model

Component Overview:

API Gateway: REST API with /invoke endpoint for client requests
API Key: Authenticates and rate-limits API requests (50 req/sec, 10k/day)
Lambda Function: Processes requests and invokes the SageMaker endpoint
SageMaker Endpoint: Real-time inference endpoint on ml.g5.xlarge (NVIDIA A10G, 24GB GPU memory)
TinyLlama-1.1B Model: HuggingFace TGI container serving the language model

Request Flow

The sequence diagram below shows how a request flows through the system from an API client to the SLM model and back:

sequenceDiagram
    actor Client
    participant API as API Gateway
    participant Lambda as Lambda Function
    participant SageMaker as SageMaker Endpoint
    participant Model as TinyLlama Model

    Client->>+API: POST /invoke<br/>(x-api-key header)
    Note over API: Validate API Key<br/>Check rate limits
    API->>+Lambda: Invoke with payload
    Note over Lambda: Extract prompt<br/>& parameters
    Lambda->>+SageMaker: invoke_endpoint()
    Note over SageMaker: Scale up if cold<br/>(0-60 seconds)
    SageMaker->>+Model: Generate text
    Note over Model: Process prompt<br/>with TinyLlama-1.1B
    Model-->>-SageMaker: Generated response
    SageMaker-->>-Lambda: Response JSON
    Lambda-->>-API: Format response
    API-->>-Client: Return generated text

In plain English:

API Gateway checks your API key (and throttles if needed)
Lambda validates the request and calls the SageMaker runtime
SageMaker runs the model container on a GPU instance
You get a JSON response back

The Model (TinyLlama-1.1B-Chat-v1.0)

TinyLlama/TinyLlama-1.1B-Chat-v1.0 is a compact instruction-tuned model with only 1.1B parameters. The small size makes it ideal for cost-effective deployments: it fits comfortably on a single ml.g5.xlarge instance while still being capable enough for basic summarization, Q&A, and conversational tasks. The model is fast and efficient, making it suitable for prototyping and low-latency applications.

In this setup it’s served via the Hugging Face TGI (Text Generation Inference) container, which gives you a straightforward request/response interface and handles batching/tokenization inside the container.

Deploy (minimal)

The repo includes a Makefile that wraps the CDK deployment. The only commands you really need are:

# 1) Authenticate
make sso-login PROFILE=ml-sage

# 2) One-time bootstrap per account/region
make bootstrap PROFILE=ml-sage REGION=eu-west-2

# 3) Deploy
make deploy PROFILE=ml-sage REGION=eu-west-2

Deployment prints an API URL and an API key id. Use the id to fetch the key value:

aws apigateway get-api-key \
  --api-key <API_KEY_ID_FROM_OUTPUT> \
  --include-value \
  --profile ml-sage \
  --region eu-west-2

Call It

Example request:

curl -X POST "https://<api-id>.execute-api.<region>.amazonaws.com/prod/invoke" \
  -H "Content-Type: application/json" \
  -H "x-api-key: <YOUR_API_KEY>" \
  -d '{
    "prompt": "In one sentence, explain what an API Gateway + Lambda front end adds in front of a SageMaker real-time endpoint.",
    "parameters": {
      "max_new_tokens": 128,
      "temperature": 0.4
    }
  }'

Example response:

{
  "generated_text": "API Gateway + Lambda gives you a stable HTTPS interface with authentication, throttling, and request/response shaping before your traffic hits the SageMaker endpoint.",
  "prompt": "In one sentence, explain what an API Gateway + Lambda front end adds in front of a SageMaker real-time endpoint.",
  "parameters": {
    "max_new_tokens": 128,
    "temperature": 0.4,
    "do_sample": true
  }
}

Cleanup (seriously)

When you’re done, delete it so you’re not paying for an always-on GPU:

make destroy PROFILE=ml-sage REGION=eu-west-2

Summary

You get a stable HTTPS /invoke endpoint in front of a GPU-backed SageMaker real-time model (ml.g5.xlarge).
API Gateway handles keys + throttling; Lambda is the thin "HTTP-to-SageMaker" shim.
TinyLlama-1.1B-Chat-v1.0 is a compact, efficient model for Q&A/summarization and general instruction following.
Deploy is basically make sso-login, make bootstrap, make deploy (see above).
Don’t forget the cost warning: when you’re done, run make destroy PROFILE=ml-sage REGION=eu-west-2.