gRPC Error Handling — Status Codes, Rich Errors, Retries, and Interceptors
Why gRPC errors are different#
REST uses HTTP status codes — 200, 404, 500. Simple but limited. gRPC has its own status code system with richer semantics and a built-in mechanism for attaching structured error details.
If you treat gRPC errors like HTTP errors, you lose half the power.
gRPC status codes#
gRPC defines 17 status codes. Every response includes exactly one.
| Code | Number | Meaning |
|---|---|---|
| OK | 0 | Success |
| CANCELLED | 1 | Client cancelled the request |
| UNKNOWN | 2 | Unknown error (often a server panic) |
| INVALID_ARGUMENT | 3 | Client sent bad input |
| DEADLINE_EXCEEDED | 4 | Timeout — operation took too long |
| NOT_FOUND | 5 | Resource does not exist |
| ALREADY_EXISTS | 6 | Resource already exists (conflict) |
| PERMISSION_DENIED | 7 | Caller lacks permission |
| RESOURCE_EXHAUSTED | 8 | Rate limit or quota exceeded |
| FAILED_PRECONDITION | 9 | Operation rejected due to system state |
| ABORTED | 10 | Operation aborted (concurrency conflict) |
| OUT_OF_RANGE | 11 | Operation outside valid range |
| UNIMPLEMENTED | 12 | Method not implemented |
| INTERNAL | 13 | Internal server error |
| UNAVAILABLE | 14 | Service temporarily unavailable |
| DATA_LOSS | 15 | Unrecoverable data loss |
| UNAUTHENTICATED | 16 | Missing or invalid authentication |
Choosing the right code#
INVALID_ARGUMENT vs FAILED_PRECONDITION: Use INVALID_ARGUMENT when the input is bad regardless of system state (malformed email). Use FAILED_PRECONDITION when the input is valid but the system is not in the right state (deleting a non-empty directory).
UNAVAILABLE vs INTERNAL: Use UNAVAILABLE for transient failures the client should retry (service restarting). Use INTERNAL for bugs the client cannot fix by retrying.
NOT_FOUND vs PERMISSION_DENIED: If exposing the existence of a resource is a security concern, return PERMISSION_DENIED instead of NOT_FOUND.
Returning errors from a server#
Basic error response (Go)#
import (
"google.golang.org/grpc/codes"
"google.golang.org/grpc/status"
)
func (s *server) GetUser(ctx context.Context, req *pb.GetUserRequest) (*pb.User, error) {
if req.UserId == "" {
return nil, status.Error(codes.InvalidArgument, "user_id is required")
}
user, err := s.db.FindUser(ctx, req.UserId)
if err != nil {
if errors.Is(err, sql.ErrNoRows) {
return nil, status.Errorf(codes.NotFound, "user %s not found", req.UserId)
}
return nil, status.Error(codes.Internal, "failed to fetch user")
}
return user, nil
}
Basic error response (Python)#
import grpc
class UserService(user_pb2_grpc.UserServiceServicer):
def GetUser(self, request, context):
if not request.user_id:
context.abort(grpc.StatusCode.INVALID_ARGUMENT, "user_id is required")
user = self.db.find_user(request.user_id)
if user is None:
context.abort(grpc.StatusCode.NOT_FOUND, f"user {request.user_id} not found")
return user
Never leak internal details. The status message goes to the client. "failed to fetch user" is fine. "connection refused to postgres://prod-db:5432" is not.
The rich error model#
A status code and message are often not enough. The client needs to know which field was invalid, how long to wait before retrying, or what went wrong in detail.
gRPC's rich error model lets you attach structured error details using protobuf messages from google.rpc.error_details.
Common error detail types#
- BadRequest — field-level validation errors
- RetryInfo — how long the client should wait before retrying
- DebugInfo — stack traces and debug data (do not expose to external clients)
- ErrorInfo — machine-readable error reason, domain, and metadata
- QuotaFailure — which quota was exceeded
- PreconditionFailure — which precondition was not met
Example: field validation errors (Go)#
import (
"google.golang.org/genproto/googleapis/rpc/errdetails"
"google.golang.org/grpc/codes"
"google.golang.org/grpc/status"
)
func validateCreateUser(req *pb.CreateUserRequest) error {
var violations []*errdetails.BadRequest_FieldViolation
if req.Email == "" {
violations = append(violations, &errdetails.BadRequest_FieldViolation{
Field: "email",
Description: "email is required",
})
}
if len(req.Password) < 8 {
violations = append(violations, &errdetails.BadRequest_FieldViolation{
Field: "password",
Description: "password must be at least 8 characters",
})
}
if len(violations) > 0 {
st := status.New(codes.InvalidArgument, "invalid request")
detailed, err := st.WithDetails(&errdetails.BadRequest{
FieldViolations: violations,
})
if err != nil {
return st.Err()
}
return detailed.Err()
}
return nil
}
Example: retry info for rate limiting#
func (s *server) ProcessOrder(ctx context.Context, req *pb.OrderRequest) (*pb.OrderResponse, error) {
if !s.rateLimiter.Allow() {
st := status.New(codes.ResourceExhausted, "rate limit exceeded")
detailed, _ := st.WithDetails(&errdetails.RetryInfo{
RetryDelay: durationpb.New(30 * time.Second),
})
return nil, detailed.Err()
}
// process order...
}
Retry policies#
gRPC has built-in client-side retry support. Configure it via service config — no application code needed.
{
"methodConfig": [{
"name": [{"service": "mypackage.MyService"}],
"retryPolicy": {
"maxAttempts": 4,
"initialBackoff": "0.1s",
"maxBackoff": "10s",
"backoffMultiplier": 2.0,
"retryableStatusCodes": ["UNAVAILABLE", "DEADLINE_EXCEEDED"]
}
}]
}
Which codes to retry#
- UNAVAILABLE — always retry. The server is temporarily down.
- DEADLINE_EXCEEDED — retry with caution. The operation might have partially completed.
- RESOURCE_EXHAUSTED — retry after the delay from RetryInfo.
- ABORTED — retry. Concurrency conflict that may resolve.
- INTERNAL — usually do not retry. This is a bug, not a transient failure.
- INVALID_ARGUMENT — never retry. The request is wrong.
Hedged requests#
For latency-sensitive calls, gRPC supports hedging: send the same request to multiple backends simultaneously and use the first response. Configure with care — it multiplies load.
{
"methodConfig": [{
"name": [{"service": "mypackage.ReadService"}],
"hedgingPolicy": {
"maxAttempts": 3,
"hedgingDelay": "0.5s",
"nonFatalStatusCodes": ["UNAVAILABLE", "INTERNAL"]
}
}]
}
Deadline propagation#
Every gRPC call should have a deadline. Deadlines prevent requests from hanging forever and propagate automatically through the call chain.
Client (5s deadline) → Service A (4.8s remaining) → Service B (4.5s remaining) → Database
Setting deadlines (Go)#
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
resp, err := client.GetUser(ctx, &pb.GetUserRequest{UserId: "123"})
if err != nil {
st := status.Convert(err)
if st.Code() == codes.DeadlineExceeded {
// handle timeout
}
}
Propagation rules#
- The deadline propagates through
context— every downstream call inherits the remaining time - A downstream service cannot extend the deadline, only shorten it
- When the deadline expires, all in-flight RPCs in the chain are cancelled
- Always check
ctx.Err()before starting expensive operations
Common mistake: no deadline#
A call without a deadline waits forever. If the server is slow, the client's goroutines/threads pile up. Eventually the client runs out of resources. Always set a deadline.
Error interceptors#
Interceptors (middleware) let you handle errors in one place instead of every RPC method.
Server-side error interceptor (Go)#
func errorInterceptor(
ctx context.Context,
req interface{},
info *grpc.UnaryServerInfo,
handler grpc.UnaryHandler,
) (interface{}, error) {
resp, err := handler(ctx, req)
if err != nil {
// Log the full error internally
log.Errorf("RPC %s failed: %v", info.FullMethod, err)
// Record metrics
errorCounter.WithLabelValues(
info.FullMethod,
status.Code(err).String(),
).Inc()
// Sanitize: do not leak internal details to clients
st := status.Convert(err)
if st.Code() == codes.Internal {
return nil, status.Error(codes.Internal, "internal error")
}
}
return resp, err
}
server := grpc.NewServer(
grpc.UnaryInterceptor(errorInterceptor),
)
What to do in interceptors#
- Log every error with full context (method, request ID, stack trace)
- Record metrics — error rate by method and status code
- Sanitize — strip internal details from INTERNAL and UNKNOWN errors
- Translate — convert domain errors to gRPC status codes
- Add metadata — inject request IDs or trace IDs into error details
Client-side handling#
Extracting error details (Go)#
resp, err := client.CreateUser(ctx, req)
if err != nil {
st := status.Convert(err)
// Check the status code
switch st.Code() {
case codes.InvalidArgument:
// Extract field violations
for _, detail := range st.Details() {
if badReq, ok := detail.(*errdetails.BadRequest); ok {
for _, v := range badReq.FieldViolations {
fmt.Printf("Field %s: %s\n", v.Field, v.Description)
}
}
}
case codes.ResourceExhausted:
// Extract retry delay
for _, detail := range st.Details() {
if retryInfo, ok := detail.(*errdetails.RetryInfo); ok {
time.Sleep(retryInfo.RetryDelay.AsDuration())
// retry the call
}
}
case codes.Unavailable:
// Retry immediately — the built-in retry policy handles this
default:
log.Errorf("unexpected error: %s — %s", st.Code(), st.Message())
}
}
Client-side best practices#
- Always check the status code before the message — codes are stable, messages are not
- Extract error details for actionable information (field violations, retry delays)
- Do not parse the message string — it is for humans, not machines
- Handle UNAVAILABLE and DEADLINE_EXCEEDED with retries
- Log the full error including details for debugging
Visualize your gRPC architecture#
Map out your services, error flows, and retry policies — try Codelit to generate an interactive architecture diagram.
Key takeaways#
- Use the right status code — INVALID_ARGUMENT for bad input, UNAVAILABLE for transient failures, INTERNAL for bugs
- Attach rich error details — field violations, retry info, and error reasons give clients actionable information
- Configure retry policies in service config — retry UNAVAILABLE and DEADLINE_EXCEEDED, never retry INVALID_ARGUMENT
- Always set deadlines — a call without a deadline is a resource leak waiting to happen
- Use interceptors for centralized logging, metrics, and error sanitization
- Never leak internal details — sanitize INTERNAL errors before they reach the client
Article #436 in the Codelit engineering series. Explore our full library of system design, infrastructure, and architecture guides at codelit.io.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Related articles
API Backward Compatibility: Ship Changes Without Breaking Consumers
6 min read
api designBatch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency
8 min read
system designCircuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j
7 min read
Try these templates
Build this architecture
Generate an interactive architecture for gRPC Error Handling in seconds.
Try it in Codelit →
Comments