This post shows a pragmatic pattern for deploying TinyLlama/TinyLlama-1.1B-Chat-v1.0 behind a SageMaker real-time endpoint, then fronting it with API Gateway + Lambda and an API key so you can call it like a normal HTTPS API.
Code: slm_sagemaker
What you’ll end up with:
- A
/invokeendpoint you cancurl - Usage-plan throttling + API key authentication
- A simple request/response shim in Lambda
- Clear diagrams so you can reason about the pieces
⚠️ COST WARNING: This deployment uses a real-time
ml.g5.xlargeGPU instance that runs 24/7. The endpoint continues billing even when not processing requests, and pricing varies by region—so destroy the stack when you're not using it:make destroy PROFILE=ml-sage REGION=eu-west-2
Architecture
System Components
The following diagram illustrates the AWS components that make up this real-time SLM solution:
architecture-beta
group api(cloud)[API Layer]
group compute(cloud)[Compute Layer]
group ml(cloud)[ML Infrastructure]
service gateway(internet)[API Gateway] in api
service apikey(disk)[API Key] in api
service lambda(server)[Lambda Function] in compute
service sagemaker(server)[SageMaker Endpoint] in ml
service model(database)[TinyLlama Model] in ml
gateway:R --> L:lambda
apikey:B --> T:gateway
lambda:R --> L:sagemaker
sagemaker:B --> T:model
Component Overview:
- API Gateway: REST API with
/invokeendpoint for client requests - API Key: Authenticates and rate-limits API requests (50 req/sec, 10k/day)
- Lambda Function: Processes requests and invokes the SageMaker endpoint
- SageMaker Endpoint: Real-time inference endpoint on
ml.g5.xlarge(NVIDIA A10G, 24GB GPU memory) - TinyLlama-1.1B Model: HuggingFace TGI container serving the language model
Request Flow
The sequence diagram below shows how a request flows through the system from an API client to the SLM model and back:
sequenceDiagram
actor Client
participant API as API Gateway
participant Lambda as Lambda Function
participant SageMaker as SageMaker Endpoint
participant Model as TinyLlama Model
Client->>+API: POST /invoke<br/>(x-api-key header)
Note over API: Validate API Key<br/>Check rate limits
API->>+Lambda: Invoke with payload
Note over Lambda: Extract prompt<br/>& parameters
Lambda->>+SageMaker: invoke_endpoint()
Note over SageMaker: Scale up if cold<br/>(0-60 seconds)
SageMaker->>+Model: Generate text
Note over Model: Process prompt<br/>with TinyLlama-1.1B
Model-->>-SageMaker: Generated response
SageMaker-->>-Lambda: Response JSON
Lambda-->>-API: Format response
API-->>-Client: Return generated text
In plain English:
- API Gateway checks your API key (and throttles if needed)
- Lambda validates the request and calls the SageMaker runtime
- SageMaker runs the model container on a GPU instance
- You get a JSON response back
The Model (TinyLlama-1.1B-Chat-v1.0)
TinyLlama/TinyLlama-1.1B-Chat-v1.0 is a compact instruction-tuned model with only 1.1B parameters. The small size makes it ideal for cost-effective deployments: it fits comfortably on a single ml.g5.xlarge instance while still being capable enough for basic summarization, Q&A, and conversational tasks. The model is fast and efficient, making it suitable for prototyping and low-latency applications.
In this setup it’s served via the Hugging Face TGI (Text Generation Inference) container, which gives you a straightforward request/response interface and handles batching/tokenization inside the container.
Deploy (minimal)
The repo includes a Makefile that wraps the CDK deployment. The only commands you really need are:
# 1) Authenticate
make sso-login PROFILE=ml-sage
# 2) One-time bootstrap per account/region
make bootstrap PROFILE=ml-sage REGION=eu-west-2
# 3) Deploy
make deploy PROFILE=ml-sage REGION=eu-west-2
Deployment prints an API URL and an API key id. Use the id to fetch the key value:
aws apigateway get-api-key \
--api-key <API_KEY_ID_FROM_OUTPUT> \
--include-value \
--profile ml-sage \
--region eu-west-2
Call It
Example request:
curl -X POST "https://<api-id>.execute-api.<region>.amazonaws.com/prod/invoke" \
-H "Content-Type: application/json" \
-H "x-api-key: <YOUR_API_KEY>" \
-d '{
"prompt": "In one sentence, explain what an API Gateway + Lambda front end adds in front of a SageMaker real-time endpoint.",
"parameters": {
"max_new_tokens": 128,
"temperature": 0.4
}
}'
Example response:
{
"generated_text": "API Gateway + Lambda gives you a stable HTTPS interface with authentication, throttling, and request/response shaping before your traffic hits the SageMaker endpoint.",
"prompt": "In one sentence, explain what an API Gateway + Lambda front end adds in front of a SageMaker real-time endpoint.",
"parameters": {
"max_new_tokens": 128,
"temperature": 0.4,
"do_sample": true
}
}
Cleanup (seriously)
When you’re done, delete it so you’re not paying for an always-on GPU:
make destroy PROFILE=ml-sage REGION=eu-west-2
Summary
- You get a stable HTTPS
/invokeendpoint in front of a GPU-backed SageMaker real-time model (ml.g5.xlarge). - API Gateway handles keys + throttling; Lambda is the thin "HTTP-to-SageMaker" shim.
- TinyLlama-1.1B-Chat-v1.0 is a compact, efficient model for Q&A/summarization and general instruction following.
- Deploy is basically
make sso-login,make bootstrap,make deploy(see above). - Don’t forget the cost warning: when you’re done, run
make destroy PROFILE=ml-sage REGION=eu-west-2.