KServe Blog

Announcing KServe v0.15 - Advancing Generative AI Model Serving

Tue, 27 May 2025 00:00:00 GMT

Published on May 27, 2025

We are thrilled to announce the release of KServe v0.15, marking a significant leap forward in serving both predictive and generative AI models. This release introduces enhanced support for generative AI workloads, including advanced features for serving large language models (LLMs), improved model and KV caching mechanisms, and integration with Envoy AI Gateway.

🤖 Embracing Generative AI Workloads

KServe v0.15 brings first-class support for generative AI workloads, marking a key evolution beyond traditional predictive AI. Unlike predictive models that infer outcomes from existing data, generative models like large language models (LLMs) create new content from prompts. This fundamental difference introduces new serving challenges. KServe now provides the infrastructure and optimizations needed to serve these models efficiently at scale.

To support these workloads, we've introduced a dedicated Generative AI section in our documentation, detailing the new capabilities and configurations tailored for generative models.

KServe now offers a lightweight installation for hosting LLMs on Kubernetes, please follow generative inference installation guide to get started. KEDA is an optional component for scaling based on LLM specific metrics and Envoy AI gateway is integrated for advanced traffic management capabilities with token rate limiting, unified API and intelligent routing.

🚀 Key Generative AI Features in v0.15

Envoy AI Gateway Integration
Multi Node Inference
LLM Autoscaler with KEDA
Distributed KV Cache with LMCache

🌐 Envoy AI Gateway Support

KServe v0.15 adds initial support for Envoy AI Gateway, a CNCF open source project built on top of Envoy Gateway and designed specifically for managing generative AI traffic at scale.

Envoy Gateway is also now supported in KServe along with Kubernetes Gateway API. Unlike traditional gateway solutions, Envoy AI Gateway provides advanced capabilities tailored to AI serving, including:

Dynamic model routing based on request content, model metadata, or user context.
Built-in support for multi-tenant inference, with fine-grained access controls and authentication.
Unified API for routing and managing LLM/AI traffic easily.
Integrated observability for model-level performance insights.
Extensibility for inference-specific policies like rate-limiting by token, and model lifecycle management.
Automatic failover mechanisms to ensure service reliability.

This integration enables a unified, intelligent entrypoint for both predictive and generative workloads—scaling from traditional models to complex LLMs—all while abstracting infrastructure complexity from the user. Please refer to Envoy AI Gateway integration doc for more details.

🔗 Multi-Node Inference

To support LLMs too large for a single node (e.g., Llama 3.1 405B), KServe v0.15 introduces multi-node inference across distributed GPUs, unlocking large model serving at scale. As models continue to increase in size, multi-node inference capabilities are increasingly important for production deployments that require real-time user experience. Please refer to the Multi Node inference doc for more details.

The community is also working on a new distributed inference API to allow scaling Multi Node Inference and support Disaggregatd Prefilling which is targeted for large LLM deployments.

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: huggingface-llama3
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      storageUri: pvc://llama-3-8b-pvc/hf/8b_instruction_tuned
    workerSpec:
      pipelineParallelSize: 2
      tensorParallelSize: 1

⚡ LLM Autoscaler with KEDA (Kubernetes Event-driven Autoscaling)

Autoscaling LLMs is challenging due to their high resource demands and variable inference traffic patterns. The dynamic nature of LLM inference, with varying input lengths and token generation speeds, further complicates the prediction of resource needs, demanding sophisticated and adaptive autoscaling solutions. KServe now integrates with KEDA (Kubernetes Event-Driven Autoscaling) offers a powerful solution to many of the challenges associated with LLM autoscaling by extending Kubernetes' native Horizontal Pod Autoscaler (HPA) capabilities. KEDA can monitor custom metrics which means you can expose LLM metrics from your LLM inference servers and use KEDA to scale based on these precise indicators.

This empowers users to efficiently manage LLM workloads with more intelligent scaling decisions based on workload characteristics for improved performance and cost optimization. Please follow the tutorial doc for how to autoscale based on vLLM metrics.

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: huggingface-llama3-keda
  annotations:
    serving.kserve.io/autoscalerClass: "keda"
    sidecar.opentelemetry.io/inject: "huggingface-llama3-keda"
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      args:
        - --model_name=llama3
        - --model_id=meta-llama/meta-llama-3-70b
    minReplicas: 1
    maxReplicas: 5
    autoScaling:
      metrics:
        - type: PodMetric
          podmetric:
            metric:
              backend: "opentelemetry"
              metricNames:
                - vllm:num_requests_running
              query: "vllm:num_requests_running"
            target:
              type: Value
              value: "4"

🚀 Distributed KV Cache with LMCache

Key-Value (KV) cache offloading is a technique used in large language model (LLM) serving to store and reuse the intermediate key and value tensors generated during model inference. In transformer-based models, these KV caches represent the context for each token processed, and reusing them allows the model to avoid redundant computations for repeated or similar prompts.

Enabling KV cache offloading across multiple requests and serving instances can achieve reduced Time To First Token(TTFT), improve scalability for shared cache across replicas, and improve user experience for multi-turn QA or RAG.

KServe integrates LMCache, the-state-of-art KV cache layer library developed by LMCache Lab to reduce inference costs and ensure SLOs for both latency and throughput at scale. Please follow the LMCache integration doc to optimize your GenAI inference workload.

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: huggingface-llama3-lmcache
spec:
  predictor:
    minReplicas: 2
    model:
      modelFormat:
        name: huggingface
      args:
        - --model_name=llama3
        - --model_id=meta-llama/meta-llama-3-70b
        - --kv-transfer-config
        - '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
        - --enable-chunked-prefill

📦 Advanced Model Caching Mechanisms

To reduce model loading times and improve overall efficiency of serving large models, KServe v0.15 introduces advanced model caching features:

LocalModelCache Enhancements: Improved the LocalModelCache custom resource to support multiple node groups, providing greater flexibility in model placement and caching strategies.
Node Agent Improvements: Enhanced the local model node agent for better performance and reliability.

🔧 Enhanced vLLM Backend Support

The vLLM backend has been significantly upgraded to better serve generative AI models:

Version Upgrade: Updated to vLLM 0.8.5, bringing performance improvements with v1 backend and new features.
Qwen3 & Llama4: Added support for Qwen3 and Llama4 models.
Reranking Support: Added support for reranking models.
Embedding Support: Added support for OpenAI-compatible embeddings API, enabling a broader range of applications.

🛠️ Additional Improvements

This release also includes several other enhancements:

Support Deep Health Checks #3348
Collocated Transformer & Predictor Feature #4255
Kubernetes Gateway API support #3952
Security Updates

🔍 Release Notes

For complete release notes including all changes, bug fixes, and known issues, visit the GitHub release page.

🙏 Acknowledgments

We extend our gratitude to all the contributors who made this release possible. Your efforts continue to drive the advancement of KServe as a leading platform for serving machine learning models.

Core Contributors: The KServe maintainers and regular as well as new contributors
Community: Everyone who reported issues, provided feedback, and tested features
Special Recognition: The generative AI community for their valuable input on LLM serving requirements

🤝 Join the Community

We invite you to explore the new features in KServe v0.15 and contribute to the ongoing development of the project:

Visit our Website or GitHub
Join the Slack (#kserve)
Attend our community meeting by subscribing to the KServe calendar.
View our community github repository to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption!

Happy serving!

The KServe team is committed to making machine learning model serving simple, scalable, and standardized. Thank you for being part of our community!

Announcing KServe v0.14

Fri, 13 Dec 2024 00:00:00 GMT

Published on December 23, 2024

We are excited to announce KServe v0.14. In this release we are introducing a new Python client designed for KServe, and a new model cache feature; we are promoting OCI storage for models as a stable feature; and we added support for deploying models directly from Hugging Face.

🚀 Key Features

Introducing Inference client for Python

The KServe Python SDK now includes both REST and GRPC inference clients. The new Inference clients of the SDK were delivered as alpha features.

Inline with the features documented in issue #3270, both clients have the following characteristics:

The clients are asynchronous
Support for HTTP/2 (via httpx library)
Support Open Inference Protocol v1 and v2
Allow client send and receive tensor data in binary format for HTTP/REST request, see binary tensor data extension docs.

As usual, the version 0.14.0 of the KServe Python SDK is published to PyPI and available to install via pip install.

Support for OCI storage for models (modelcars) becomes stable

In KServe version 0.12, support for using OCI containers for model storage was introduced as an experimental feature. This allows users to store models in containers in OCI format, and allows the usage of OCI-compatible registries for publishing the models.

This feature was implemented by configuring the OCI model container as a sidecar in the InferenceService pod, which was the motivation for naming the feature as modelcars. The model files are made available to the model server by configuring process namespace sharing in the pod.

There was one small but important detail that was unsolved and motivated the experimental status: since the modelcar is part of the main containers of the pod, there was no certainty that the modelcar would start quickly. The model server would be unstable if it starts first than the modelcar, and since there was no prefetching of the model image, this was thought as a likely condition.

The unstable situation has been mitigated by configuring the OCI model as an init container in addition to also configuring it as a sidecar. The configuration as an init container ensures that the model is fetched before the main containers are started. The prefetching allows the modelcar to start quickly. The stabilization is available since KServe version 0.14, where modelcars are now a stable feature.

Future plan

Modelcars is one implementation option for supporting OCI images for model storage. There are other alternatives commented in issue #4083.

Using volume mounts based on OCI artifacts is the optimal implementation, but this is only recently possible since Kubernetes 1.31 as a native alpha feature. KServe can now evolve to use this new Kubernetes feature.

Introducing Model Cache

With models increasing in size, specially true for LLM models, pulling from storage each time a pod is created can result in unmanageable start-up times. Although OCI storage also has the benefit of model caching, the capabilities are not flexible since the management is delegated to the cluster.

The Model Cache was proposed as another alternative to enhance KServe usability with big models, released in KServe v0.14 as an alpha feature. In this release local node storage is used for storing models and LocalModelCache custom resource provides the control about which models to store in the cache. The local model cache state can always be rebuilt from the models stored on persistent storage like model registry or S3. Read the design document for the details.

By caching the models, you get the following benefits:

Minimize the time it takes for LLM pods to start serving requests.
Sharing the same storage for pods scheduled on the same GPU node.
Model Cache allows scaling your AI workload efficiently without worrying about the slow model server container startup.

The model cache is currently disabled by default. To enable, you need to modify the localmodel.enabled field on the inferenceservice-config ConfigMap.

You can follow local model cache tutorial to cache LLMs on local NVMe of your GPU nodes and deploy LLMs with InferenceService by loading models from local cache to accelerate the container startup.

Support for Hugging Face hub in storage initializer

The KServe storage initializer has been enhanced to support downloading models directly from Hugging Face. For this, the new schema hf:// is now supported in the storageUri field of InferenceServices. The following YAML partial shows this:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: huggingface-llama3
spec:
  predictor:
    model:
      storageUri: hf://meta-llama/meta-llama-3-8b-instruct

Both public and private Hugging Face repositories are supported. The credentials can be provided by the usual mechanism of binding Secrets to ServiceAccounts, or by binding the credentials Secret as environment variables in the InferenceService.

Read the documentation for more details.

🛠️ Enhancements and Improvements

Hugging Face vLLM backend changes

vLLM backend to update to 0.6.1 #3948
Support trust_remote_code flag for vllm #3729
Support text embedding task in hugging face server #3743
Add health endpoint for vLLM backend #3850
Added hostIPC field to ServingRuntime CRD, for supporting more than one GPU in Serverless mode #3791
Support shared memory volume for vLLM backend #3910

Other Enhancements

New flag for automount serviceaccount token by #3979
TLS support for inference loggers #3837
Allow PVC storage to be mounted in ReadWrite mode via an annotation #3687
Support HTTP Headers passing for KServe python custom runtimes #3669

⚠️ What's Changed

Ray is now an optional dependency #3834
Support for Python 3.12 is added, while support Python 3.8 is removed #3645

🔍 Release Notes

For complete release notes including all changes, bug fixes, and known issues, visit the GitHub release page.

🙏 Acknowledgments

We want to thank all the contributors who made this release possible:

Core Contributors: The KServe maintainers and regular as well as new contributors
Community: Everyone who reported issues, provided feedback, and tested features

🤝 Join the community

Visit our Website or GitHub
Join the Slack (#kserve)
Attend our community meeting by subscribing to the KServe calendar.
View our community github repository to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption!

The KServe team is committed to making machine learning model serving simple, scalable, and standardized. Thank you for being part of our community!

From Serverless Predictive Inference to Generative Inference - Introducing KServe v0.13

Wed, 15 May 2024 00:00:00 GMT

Published on May 15, 2024

We are excited to unveil KServe v0.13, marking a significant leap forward in evolving cloud native model serving to meet the demands of Generative AI inference. This release is highlighted by three pivotal updates: enhanced Hugging Face runtime, robust vLLM backend support for Generative Models, and the integration of OpenAI protocol standards.

Below are a summary of the key changes.

🚀 Enhanced Hugging Face Runtime Support

KServe v0.13 enriches its Hugging Face runtime and now supports running Hugging Face models out-of-the-box. KServe v0.13 implements a KServe Hugging Face Serving Runtime, kserve-huggingfaceserver. With this implementation, KServe can now automatically infer a task from model architecture and select the optimized serving runtime. Currently supported tasks include sequence classification, token classification, fill mask, text generation, and text to text generation.

Here is an example to serve BERT model by deploying an Inference Service with Hugging Face runtime for classification task.

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: huggingface-bert
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      args:
      - --model_name=bert
      - --model_id=bert-base-uncased
      - --tensor_input_names=input_ids
      resources:
        limits:
          cpu: "1"
          memory: 2Gi
          nvidia.com/gpu: "1"
        requests:
          cpu: 100m
          memory: 2Gi
          nvidia.com/gpu: "1"

You can also deploy BERT on the more optimized inference runtime like Triton using Hugging Face Runtime for pre/post processing, see more details here.

🔧 vLLM Support

Version 0.13 introduces dedicated runtime support for vLLM, for enhanced transformer model serving. This support now includes auto-mapping vLLMs as the backend for supported tasks, streamlining the deployment process and optimizing performance. If vLLM does not support a particular task, it will default to the Hugging Face backend. See example below.

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: huggingface-llama3
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      args:
      - --model_name=llama3
      - --model_id=meta-llama/meta-llama-3-8b-instruct
      resources:
        limits:
          cpu: "6"
          memory: 24Gi
          nvidia.com/gpu: "1"
        requests:
          cpu: "6"
          memory: 24Gi
          nvidia.com/gpu: "1"

See more details in our updated docs to Deploy the Llama3 model with Hugging Face LLM Serving Runtime.

Additionally, if the Hugging Face backend is preferred over vLLM, vLLM auto-mapping can be disabled with the --backend=huggingface arg.

🌐 OpenAI Schema Integration

Embracing the OpenAI protocol, KServe v0.13 now supports three specific endpoints for generative transformer models:

/openai/v1/completions
/openai/v1/chat/completions
/openai/v1/models

These endpoints are useful for generative transformer models, which take in messages and return a model-generated message output. The chat completions endpoint is designed for easily handling multi-turn conversations, while still being useful for single-turn tasks. The completions endpoint is now a legacy endpoint that differs with the chat completions endpoint in that the interface for completions is a freeform text string called a prompt. Read more about the chat completions and completions endpoints in the OpenAI API docs.

This update fosters a standardized approach to transformer model serving, ensuring compatibility with a broader spectrum of models and tools, and enhances the platform's versatility. The API can be directly used with OpenAI's client libraries or third-party tools, like LangChain or LlamaIndex.

🔮 Future Plan

Support other tasks like text embeddings #3572.
Support more LLM backend options in the future, such as TensorRT-LLM.
Enrich text generation metrics for Throughput(tokens/sec), TTFT(Time to first token) #3461.
KEDA integration for token based LLM Autoscaling #3561.

🛠️ Other Changes

This release also includes several enhancements and changes:

✨ What's New?

Async streaming support for v1 endpoints #3402.
Support for .json and .ubj model formats in the XGBoost server image #3546.
Enhanced flexibility in KServe by allowing the configuration of multiple domains for an inference service #2747.
Enhanced the manager setup to dynamically adapt based on available CRDs, improving operational flexibility and reliability across different deployment environments #3470.

⚠️ What's Changed?

Removed Seldon Alibi dependency #3380.
Removal of conversion webhook from manifests. #3344.

🔍 Release Notes

For complete release notes including all changes, bug fixes, and known issues, visit the GitHub release page.

🙏 Acknowledgments

We want to thank all the contributors who made this release possible:

Core Contributors: The KServe maintainers and regular as well as new contributors
Community: Everyone who reported issues, provided feedback, and tested features
Special Recognition: Contributors who helped drive the generative AI capabilities forward

🤝 Join the Community

Visit our Website or GitHub
Join the Slack (#kserve)
Attend our community meeting by subscribing to the KServe calendar.
View our community github repository to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption!

The KServe team is committed to making machine learning model serving simple, scalable, and standardized. Thank you for being part of our community!

Announcing KServe v0.11

Sun, 08 Oct 2023 00:00:00 GMT

Published on October 8, 2023

We are excited to announce the release of KServe 0.11. In this release we introduced Large Language Model (LLM) runtimes, made enhancements to the KServe control plane, Python SDK Open Inference Protocol support and dependency management. For ModelMesh we have added features PVC, HPA, payload logging to ensure feature parity with KServe.

Here is a summary of the key changes:

🚀 KServe Core Inference Enhancements

Path-based routing support which is served as an alternative way to the host based routing, the URL of the InferenceService could look like http:///serving//. Please refer to the doc for how to enable path based routing.
Priority field for Serving Runtime custom resource to handle the case when you have multiple serving runtimes which support the same model formats, see more details from the serving runtime doc.

Custom Storage Container CRD to allow customized implementations with supported storage URI prefixes, example use cases are private model registry integration:

  apiVersion: "serving.kserve.io/v1alpha1"
  kind: ClusterStorageContainer
  metadata:
    name: default
  spec:
    container:
      name: storage-initializer
      image: kserve/model-registry:latest
      resources:
        requests:
          memory: 100Mi
          cpu: 100m
        limits:
          memory: 1Gi
          cpu: "1"
    supportedUriFormats:
      - prefix: model-registry://

Inference Graph enhancements for improving the API spec to support pod affinity and resource requirement fields. Dependency field with options Soft and Hard is introduced to handle error responses from the inference steps to decide whether to short-circuit the request in case of errors, see the following example with hard dependency with the node steps:

  apiVersion: serving.kserve.io/v1alpha1
  kind: InferenceGraph
  metadata:
    name: graph_with_switch_node
  spec:
    nodes:
      root:
        routerType: Sequence
        steps:
          - name: "rootStep1"
            nodeName: node1
            dependency: Hard
          - name: "rootStep2"
            serviceName: {{ success_200_isvc_id }}
      node1:
        routerType: Switch
        steps:
          - name: "node1Step1"
            serviceName: {{ error_404_isvc_id }}
            condition: "[@this].#(decision_picker==ERROR)"
            dependency: Hard

For more details please refer to the issue.

Improved InferenceService debugging experience by adding the aggregated RoutesReady status and LastDeploymentReady condition to the InferenceService Status to differentiate the endpoint and deployment status. This applies to the serverless mode and for more details refer to the API docs.

📦 Enhanced Python SDK Dependency Management

KServe has adopted poetry to manage python dependencies. You can now install the KServe SDK with locked dependencies using poetry install. While pip install still works, we highly recommend using poetry to ensure predictable dependency management.
The KServe SDK is also slimmed down by making the cloud storage dependency optional, if you require storage dependency for custom serving runtimes you can still install with pip install kserve[storage].

🔧 KServe Python Runtimes Improvements

KServe Python Runtimes including sklearnserver, lgbserver, xgbserver now support the open inference protocol for both REST and gRPC.
Logging improvements including adding Uvicorn access logging and a default KServe logger.
Postprocess handler has been aligned with open inference protocol, simplifying the underlying transportation protocol complexities.

🤖 LLM Runtimes

TorchServe LLM Runtime

KServe now integrates with TorchServe 0.8, offering the support for LLM models that may not fit onto a single GPU. Huggingface Accelerate and Deepspeed are available options to split the model into multiple partitions over multiple GPUs. You can see the detailed example for how to serve the LLM on KServe with TorchServe runtime.

vLLM Runtime

Serving LLM models can be surprisingly slow even on high end GPUs, vLLM is a fast and easy-to-use LLM inference engine. It can achieve 10x-20x higher throughput than Huggingface transformers. It supports continuous batching for increased throughput and GPU utilization, paged attention to address the memory bottleneck where in the autoregressive decoding process all the attention key value tensors(KV Cache) are kept in the GPU memory to generate next tokens.

In the example we show how to deploy vLLM on KServe and expects further integration in KServe 0.12 with proposed generate endpoint for open inference protocol.

📊 ModelMesh Updates

💾 Storing Models on Kubernetes Persistent Volumes (PVC)

ModelMesh now allows to directly mount model files onto serving runtimes pods using Kubernetes Persistent Volumes. Depending on the selected storage solution this approach can significantly reduce latency when deploying new predictors, potentially remove the need for additional S3 cloud object storage like AWS S3, GCS, or Azure Blob Storage altogether.

⚡ Horizontal Pod Autoscaling (HPA)

Kubernetes Horizontal Pod Autoscaling can now be used at the serving runtime pod level. With HPA enabled, the ModelMesh controller no longer manages the number of replicas. Instead, a HorizontalPodAutoscaler automatically updates the serving runtime deployment with the number of Pods to best match the demand.

📈 Model Metrics, Metrics Dashboard, Payload Event Logging

ModelMesh v0.11 introduces a new configuration option to emit a subset of useful metrics at the individual model level. These metrics can help identify outlier or "heavy hitter" models and consequently fine-tune the deployments of those inference services, like allocating more resources or increasing the number of replicas for improved responsiveness or avoid frequent cache misses.

A new Grafana dashboard was added to display the comprehensive set of Prometheus metrics like model loading and unloading rates, internal queuing delays, capacity and usage, cache state, etc. to monitor the general health of the ModelMesh Serving deployment.

The new PayloadProcessor interface can be implemented to log prediction requests and responses, to create data sinks for data visualization, for model quality assessment, or for drift and outlier detection by external monitoring systems.

⚠️ What's Changed?

To allow longer InferenceService name due to DNS max length limits from issue, the Default suffix in the inference service component(predictor/transformer/explainer) name has been removed for newly created InferenceServices. This affects the client that is using the component url directly instead of the top level InferenceService url.
Status.address.url is now consistent for both serverless and raw deployment mode, the url path portion is dropped in serverless mode.
Raw bytes are now accepted in v1 protocol, setting the right content-type header to application/json is required to recognize and decode the json payload if content-type is specified.

curl -v -H "Content-Type: application/json" http://sklearn-iris.kserve-test.${CUSTOM_DOMAIN}/v1/models/sklearn-iris:predict -d @./iris-input.json

🔍 Release Notes

For complete release notes including all changes, bug fixes, and known issues, visit the GitHub release pages for KServe v0.11 and ModelMesh v0.11.

🙏 Acknowledgments

We want to thank all the contributors who made this release possible:

Core Contributors: The KServe maintainers and regular as well as new contributors
Community: Everyone who reported issues, provided feedback, and tested features
Working Group: All members of the KServe Working Group for their ongoing collaboration

🤝 Join the Community

Visit our Website or GitHub
Join the Slack (#kserve)
Attend our community meeting by subscribing to the KServe calendar.
View our community github repository to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption!

The KServe team is committed to making machine learning model serving simple, scalable, and standardized. Thank you for being part of our community!

Announcing KServe v0.10.0

Sun, 05 Feb 2023 00:00:00 GMT

Published on February 5, 2023

We are excited to announce KServe 0.10 release. In this release we have enabled more KServe networking options, improved KServe telemetry for supported serving runtimes and increased support coverage for Open(aka v2) inference protocol for both standard and ModelMesh InferenceService.

🌐 KServe Networking Options

Istio is now optional for both Serverless and RawDeployment mode. Please see the alternative networking guide for how you can enable other ingress options supported by Knative with Serverless mode. For Istio users, if you want to turn on full service mesh mode to secure InferenceService with mutual TLS and enable the traffic policies, please read the service mesh setup guideline.

📊 KServe Telemetry for Serving Runtimes

We have instrumented additional latency metrics in KServe Python ServingRuntimes for preprocess, predict and postprocess handlers. In Serverless mode we have extended Knative queue-proxy to enable metrics aggregation for both metrics exposed in queue-proxy and kserve-container from each ServingRuntime. Please read the prometheus metrics setup guideline for how to enable the metrics scraping and aggregations.

🚀 Open(v2) Inference Protocol Support Coverage

As there have been increasing adoptions for KServe v2 Inference Protocol from AMD Inference ServingRuntime which supports FPGAs and OpenVINO which now provides KServe REST and gRPC compatible API, in the issue we have proposed to rename to KServe Open Inference Protocol.

In KServe 0.10, we have added Open(v2) inference protocol support for KServe custom runtimes. Now, you can enable v2 REST/gRPC for both custom transformer and predictor with images built by implementing KServe Python SDK API. gRPC enables high performance inference data plane as it is built on top of HTTP/2 and binary data transportation which is more efficient to send over the wire compared to REST. Please see the detailed example for transformer and predictor.

from kserve import Model

def image_transform(byte_array):
    image_processing = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
    ])
    image = Image.open(io.BytesIO(byte_array))
    tensor = image_processing(image).numpy()
    return tensor

class CustomModel(Model):
    def predict(self, request: InferRequest, headers: Dict[str, str]) -> InferResponse:
        input_tensors = [image_transform(instance) for instance in request.inputs[0].data]
        input_tensors = np.asarray(input_tensors)
        output = self.model(input_tensors)
        torch.nn.functional.softmax(output, dim=1)
        values, top_5 = torch.topk(output, 5)
        result = values.flatten().tolist()
        response_id = generate_uuid()
        infer_output = InferOutput(name="output-0", shape=list(values.shape), datatype="FP32", data=result)
        infer_response = InferResponse(model_name=self.name, infer_outputs=[infer_output], response_id=response_id)
        return infer_response

class CustomTransformer(Model):
    def preprocess(self, request: InferRequest, headers: Dict[str, str]) -> InferRequest:
        input_tensors = [image_transform(instance) for instance in request.inputs[0].data]
        input_tensors = np.asarray(input_tensors)
        infer_inputs = [InferInput(name="INPUT__0", datatype='FP32', shape=list(input_tensors.shape),
                                   data=input_tensors)]
        infer_request = InferRequest(model_name=self.model_name, infer_inputs=infer_inputs)
        return infer_request

You can use the same Python API type InferRequest and InferResponse for both REST and gRPC protocol. KServe handles the underlying decoding and encoding according to the protocol.

⚠️ Warning: A new headers argument is added to the custom handlers to pass http/gRPC headers or other metadata. You can also use this as context dict to pass data between handlers. If you have existing custom transformer or predictor, the headers argument is now required to add to the preprocess, predict and postprocess handlers.

Please check the following matrix for supported ModelFormats and ServingRuntimes.

Model Format	v1	Open(v2) REST/gRPC
Tensorflow	✅ TFServing	✅ Triton
PyTorch	✅ TorchServe	✅ TorchServe
TorchScript	✅ TorchServe	✅ Triton
ONNX	❌	✅ Triton
Scikit-learn	✅ KServe	✅ MLServer
XGBoost	✅ KServe	✅ MLServer
LightGBM	✅ KServe	✅ MLServer
MLFlow	❌	✅ MLServer
Custom	✅ KServe	✅ KServe

🏗️ Multi-Arch Image Support

KServe control plane images kserve-controller, kserve/agent, kserve/router are now supported for multiple architectures: ppc64le, arm64, amd64, s390x.

🔐 KServe Storage Credentials Support

Currently, AWS users need to create a secret with long term/static IAM credentials for downloading models stored in S3. Security best practice is to use IAM role for service account(IRSA) which enables automatic credential rotation and fine-grained access control, see how to setup IRSA.
Support Azure Blobs with managed identity.

📊 ModelMesh Updates

ModelMesh has continued to integrate itself as KServe's multi-model serving backend, introducing improvements and features that better align the two projects. For example, it now supports ClusterServingRuntimes, allowing use of cluster-scoped ServingRuntimes, originally introduced in KServe 0.8.

Additionally, ModelMesh introduced support for TorchServe enabling users to serve arbitrary PyTorch models (e.g. eager-mode) in the context of distributed-multi-model serving.

Other limitations have been addressed as well, such as adding support for BYTES/string type tensors when using the REST inference API for inference requests that require them.

🔍 Release Notes

For complete release notes including all changes, bug fixes, and known issues, visit the GitHub release pages for KServe v0.10 and ModelMesh v0.10.

🙏 Acknowledgments

We want to thank all the contributors who made this release possible:

Individual Contributors:

Core Contributors: The KServe maintainers and working group members

Community: Everyone who reported issues, provided feedback, and tested features

🤝 Join the Community

Visit our Website or GitHub
Join the Slack (#kserve)
Attend our community meeting by subscribing to the KServe calendar.
View our community github repository to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption!

The KServe team is committed to making machine learning model serving simple, scalable, and standardized. Thank you for being part of our community!

Announcing KServe v0.9.0

Thu, 21 Jul 2022 00:00:00 GMT

Published on July 21, 2022

Today, we are pleased to announce the v0.9.0 release of KServe! KServe has now fully onboarded to LF AI & Data Foundation as an Incubation Project! 🎉

In this release we are excited to introduce the new InferenceGraph feature which has long been asked from the community. Also continuing the effort from the last release for unifying the InferenceService API for deploying models on KServe and ModelMesh, ModelMesh is now fully compatible with KServe InferenceService API!

🚀 Introducing InferenceGraph

The ML Inference system is getting bigger and more complex. It often consists of many models to make a single prediction. The common use cases are image classification and natural language multi-stage processing pipelines. For example, an image classification pipeline needs to run top level classification first then downstream further classification based on previous prediction results.

KServe has the unique strength to build the distributed inference graph with its native integration of InferenceServices, standard inference protocol for chaining models and serverless auto-scaling capabilities. KServe leverages these strengths to build the InferenceGraph and enable users to deploy complex ML Inference pipelines to production in a declarative and scalable way.

InferenceGraph is made up of a list of routing nodes with each node consisting of a set of routing steps. Each step can either route to an InferenceService or another node defined on the graph which makes the InferenceGraph highly composable. The graph router is deployed behind an HTTP endpoint and can be scaled dynamically based on request volume. The InferenceGraph supports four different types of routing nodes: Sequence, Switch, Ensemble, Splitter.

Sequence Node: It allows users to define multiple Steps with InferenceServices or Nodes as routing targets in a sequence. The Steps are executed in sequence and the request/response from the previous step and be passed to the next step as input based on configuration.
Switch Node: It allows users to define routing conditions and select a Step to execute if it matches the condition. The response is returned as soon as it finds the first step that matches the condition. If no condition is matched, the graph returns the original request.
Ensemble Node: A model ensemble requires scoring each model separately and then combines the results into a single prediction response. You can then use different combination methods to produce the final result. Multiple classification trees, for example, are commonly combined using a "majority vote" method. Multiple regression trees are often combined using various averaging techniques.
Splitter Node: It allows users to split the traffic to multiple targets using a weighted distribution.

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "cat-dog-classifier"
spec:
  predictor:
    pytorch:
      resources:
        requests:
          cpu: 100m
      storageUri: gs://kfserving-examples/models/torchserve/cat_dog_classification
---
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "dog-breed-classifier"
spec:
  predictor:
    pytorch:
      resources:
        requests:
          cpu: 100m
      storageUri: gs://kfserving-examples/models/torchserve/dog_breed_classification
---
apiVersion: "serving.kserve.io/v1alpha1"
kind: "InferenceGraph"
metadata:
  name: "dog-breed-pipeline"
spec:
  nodes:
    root:
      routerType: Sequence
      steps:
      - serviceName: cat-dog-classifier
        name: cat_dog_classifier # step name
      - serviceName: dog-breed-classifier
        name: dog_breed_classifier
        data: $request
        condition: "[@this].#(predictions.0==\"dog\")"

Currently InferenceGraph is supported with the Serverless deployment mode. You can try it out following the tutorial.

🔗 InferenceService API for ModelMesh

The InferenceService CRD is now the primary interface for interacting with ModelMesh. Some changes were made to the InferenceService spec to better facilitate ModelMesh's needs.

💾 Storage Spec

To unify how model storage is defined for both single and multi-model serving, a new storage spec was added to the predictor model spec. With this storage spec, users can specify a key inside a common secret holding config/credentials for each of the storage backends from which models can be loaded. Example:

storage:
  key: localMinIO # Credential key for the destination storage in the common secret
  path: sklearn # Model path inside the bucket
  # schemaPath: null # Optional schema files for payload schema
  parameters: # Parameters to override the default values inside the common secret.
    bucket: example-models

Learn more here.

📊 Model Status

For further alignment between ModelMesh and KServe, some additions to the InferenceService status were made. There is now a Model Status section which contains information about the model loaded in the predictor. New fields include:

states - State information of the predictor's model.
activeModelState - The state of the model currently being served by the predictor's endpoints.
targetModelState - This will be set only when transitionStatus is not UpToDate, meaning that the target model differs from the currently-active model.
transitionStatus - Indicates state of the predictor relative to its current spec.
modelCopies - Model copy information of the predictor's model.
lastFailureInfo - Details about the most recent error associated with this predictor. Not all of the contained fields will necessarily have a value.

🚢 Deploying on ModelMesh

For deploying InferenceServices on ModelMesh, the ModelMesh and KServe controllers will still require that the user specifies the serving.kserve.io/deploymentMode: ModelMesh annotation. A complete example on an InferenceService with the new storage spec is showing below:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: example-tensorflow-mnist
  annotations:
    serving.kserve.io/deploymentMode: ModelMesh
spec:
  predictor:
    model:
      modelFormat:
        name: tensorflow
      storage:
        key: localMinIO
        path: tensorflow/mnist.savedmodel

🛠️ Other New Features

Support serving MLFlow model format via MLServer serving runtime.
Support unified autoscaling target and metric fields for InferenceService components with both Serverless and RawDeployment mode.
Support InferenceService ingress class and url domain template configuration for RawDeployment mode.
ModelMesh now has a default OpenVINO Model Server ServingRuntime.

⚠️ What's Changed?

The KServe controller manager is changed from StatefulSet to Deployment to support HA mode.
log4j security vulnerability fix
Upgrade TorchServe serving runtime to 0.6.0
Update MLServer serving runtime to 1.0.0

🔍 Release Notes

For complete release notes including all changes, bug fixes, and known issues, visit the GitHub release pages for KServe and ModelMesh for more details.

🙏 Acknowledgments

We want to thank all the contributors who made this release possible:

Core Contributors: The KServe maintainers and working group members
Community: Everyone who reported issues, provided feedback, and tested features
LF AI & Data Foundation: For supporting KServe's journey as an incubation project

🤝 Join the Community

Visit our Website or GitHub
Join the Slack (#kserve)
Attend our community meeting by subscribing to the KServe calendar.
View our community github repository to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption!

The KServe team is committed to making machine learning model serving simple, scalable, and standardized. Thank you for being part of our community!

Announcing KServe v0.8

Fri, 18 Feb 2022 00:00:00 GMT

Published on February 18, 2022

Today, we are pleased to announce the v0.8.0 release of KServe! While the last release was focused on the transition of KFServing to KServe, this release was focused on unifying the InferenceService API for deploying models on KServe and ModelMesh.

Note: For current users of KFServing/KServe, please take a few minutes to answer this short survey and provide your feedback!

⚠️ What's Changed

ONNX Runtime Server has been removed from the supported serving runtime list. KServe by default now uses the Triton Inference Server to serve ONNX models.
KServe's PyTorchServer has been removed from the supported serving runtime list. KServe by default now uses TorchServe to serve PyTorch models.
A few main KServe SDK class names have been changed:
- KFModel is renamed to Model
- KFServer is renamed to ModelServer
- KFModelRepository is renamed to ModelRepository

🚀 What's New

Some notable updates are:

ClusterServingRuntime and ServingRuntime CRDs are introduced. Learn more below.
A new Model Spec was introduced to the InferenceService Predictor Spec as a new way to specify models. Learn more below.
Knative 1.0 is now supported and certified for the KServe Serverless installation.
gRPC is now supported for transformer to predictor network communication.
TorchServe Serving runtime has been updated to 0.5.2 which now supports the KServe V2 REST protocol.
ModelMesh now has multi-namespace support, and users can now deploy GCS or HTTP(S) hosted models.

🔧 ServingRuntimes and ClusterServingRuntimes

This release introduces two new CRDs ServingRuntimes and ClusterServingRuntimes with the only difference between these two is that one is namespace-scoped and one is cluster-scoped. A ServingRuntime defines the templates for Pods that can serve one or more particular model formats. Each ServingRuntime defines key information such as the container image of the runtime and a list of the model formats that the runtime supports.

In previous versions of KServe, supported predictor formats and container images were defined in a config map in the control plane namespace. The ServingRuntime CRD should allow for improved flexibility and extensibility for defining or customizing runtimes to how you see fit without having to modify any controller code or any resources in the controller namespace.

Several out-of-the-box ClusterServingRuntimes are provided with KServe so that users can continue to use KServe how they did before without having to define the runtimes themselves.

Example SKLearn ClusterServingRuntime:

apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
  name: kserve-sklearnserver
spec:
  supportedModelFormats:
    - name: sklearn
      version: "1"
      autoSelect: true
  containers:
    - name: kserve-container
      image: kserve/sklearnserver:latest
      args:
        - --model_name={{.Name}}
        - --model_dir=/mnt/models
        - --http_port=8080
      resources:
        requests:
          cpu: "1"
          memory: 2Gi
        limits:
          cpu: "1"
          memory: 2Gi

📋 Updated InferenceService Predictor Spec

A new Model spec was also introduced as a part of the Predictor spec for InferenceServices. One of the problems KServe was having was that the InferenceService CRD was becoming unwieldy with each model serving runtime being an object in the Predictor spec. This generated a lot of field duplication in the schema, bloating the overall size of the CRD. If a user wanted to introduce a new model serving framework for KServe to support, the CRD would have to be modified, and subsequently the controller code.

Now, with the Model spec, a user can specify a model format and optionally a corresponding version. The KServe control plane will automatically select and use the ClusterServingRuntime or ServingRuntime that supports the given format. Each ServingRuntime maintains a list of supported model formats and versions. If a format has autoselect as true, then that opens the ServingRuntime up for automatic model placement for that model format.

New Schema
Previous Schema

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: example-sklearn-isvc
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: s3://bucket/sklearn/mnist.joblib

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: example-sklearn-isvc
spec:
  predictor:
    sklearn:
      storageUri: s3://bucket/sklearn/mnist.joblib

The previous way of defining predictors is still supported, however, the new approach will be the preferred one going forward. Eventually, the previous schema, with the framework names as keys in the predictor spec, will be removed.

🌐 ModelMesh Updates

ModelMesh has been in the process of integrating as KServe's multi-model serving backend. With the inclusion of the aforementioned ServingRuntime CRDs and the Predictor Model spec, the two projects are now much more aligned, with continual improvements underway.

ModelMesh now supports multi-namespace reconciliation. Previously, the ModelMesh controller would only reconcile against resources deployed in the same namespace as the controller. Now, by default, ModelMesh will be able to handle InferenceService deployments in any "modelmesh-enabled" namespace. Learn more here.

Also, while ModelMesh previously only supported S3-based storage, we are happy to share that ModelMesh now works with models hosted using GCS and HTTP(S).

🔍 Release Notes

To see all release updates, check out the KServe release notes and ModelMesh Serving release notes!

🙏 Acknowledgments

We want to thank all the contributors who made this release possible:

Authors: Dan Sun, Paul Van Eck, Vedant Padwal, Andrews Arokiam on behalf of the KServe Working Group
Core Contributors: The KServe maintainers and working group members
Community: Everyone who reported issues, provided feedback, and tested features

🤝 Join the Community

Visit our Website or GitHub
Join the Slack (#kubeflow-kfserving)
Attend a biweekly community meeting on Wednesday 9am PST
View our developer and doc contribution guides to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption!

Happy serving!

The KServe team is committed to making machine learning model serving simple, scalable, and standardized. Thank you for being part of our community!

Announcing KServe v0.7 - Smooth Transition from KFServing to KServe

Mon, 11 Oct 2021 00:00:00 GMT

Published on October 11, 2021

KFServing is now KServe and KServe 0.7 release is available, the release also ensures a smooth user migration experience from KFServing to KServe.

⚠️ What's Changed

InferenceService API group is changed from serving.kubeflow.org to serving.kserve.io #1826, the migration job is created for smooth transition.
Python SDK name is changed from kfserving to kserve.
KServe Installation manifests #1824.
Models-web-app is separated out of the kserve repository to models-web-app.
Docs and examples are moved to separate repository website.
KServe images are migrated to kserve docker hub account.
v1alpha2 API group is deprecated #1850.

🚀 What's New

ModelMesh project is joining KServe under repository modelmesh-serving!

ModelMesh is designed for high-scale, high-density and frequently-changing model use cases. ModelMesh intelligently loads and unloads AI models to and from memory to strike an intelligent trade-off between responsiveness to users and computational footprint. To learn more about ModelMesh features and components, check out the ModelMesh announcement blog and Join talk at #KubeCon NA to get a deeper dive into ModelMesh and KServe.
(Alpha feature) Raw Kubernetes deployment support, Istio/Knative dependency is now optional and please follow the guide to install and turn on RawDeployment mode.
KServe now has its own documentation website temporarily hosted on website.
Support v1 crd and webhook configuration for Kubernetes 1.22 #1837.
Triton model serving runtime now defaults to 21.09 version #1840.

🔧 What's Fixed

Bug fix for Azure blob storage #1845.
Tar/Zip support for all storage options #1836.
Fix AWS_REGION env variable and add AWS_CA_BUNDLE for S3 #1780.
Torchserve custom package install fix #1619.

🔍 Release Notes

For complete release notes including all changes, bug fixes, and known issues, visit the GitHub release page.

🙏 Acknowledgments

We want to thank all the contributors who made this release possible:

Individual Contributors:

Core Contributors: The KServe maintainers and working group members

Community: Everyone who reported issues, provided feedback, and tested features during this important transition

🤝 Join the Community

Visit our Website or GitHub
Join the Slack (#kubeflow-kfserving)
Attend a Biweekly community meeting on Wednesday 9am PST
Contribute at developer and doc contribution guide to make code or doc contributions. We are excited to work with you to make KServe better and promote its adoption by more and more users!

Happy serving!

The KServe team is committed to making machine learning model serving simple, scalable, and standardized. Thank you for being part of our community during this important transition!

KServe: The next generation of KFServing

Mon, 27 Sep 2021 00:00:00 GMT

Published on September 27, 2021

We are excited to announce the next chapter for KFServing. In coordination with the Kubeflow Project Steering Group, the KFServing GitHub repository has now been transferred to an independent KServe GitHub organization under the stewardship of the Kubeflow Serving Working Group leads.

The project has been rebranded from KFServing to KServe, and we are planning to graduate the project from Kubeflow Project later this year.

🎯 Project Background

Developed collaboratively by Google, IBM, Bloomberg, NVIDIA, and Seldon in 2019, KFServing was published as open source in early 2019. The project sets out to provide the following features:

A simple, yet powerful, Kubernetes Custom Resource for deploying machine learning (ML) models on production across ML frameworks.
Provide performant, standardized inference protocol.
Serverless inference according to live traffic patterns, supporting "Scale-to-zero" on both CPUs and GPUs.
Complete story for production ML Model Serving including prediction, pre/post-processing, explainability, and monitoring.
Support for deploying thousands of models at scale and inference graph capability for multiple models.

KFServing was created to address the challenges of deploying and monitoring machine learning models on production for organizations. After publishing the open source project, we've seen an explosion in demand for the software, leading to strong adoption and community growth. The scope of the project has since increased, and we have developed multiple components along the way, including our own growing body of documentation that needs its own website and independent GitHub organization.

🚀 What's Next

Over the coming weeks, we will be releasing KServe 0.7 outside of the Kubeflow Project and will provide more details on how to migrate from KFServing to KServe with minimal disruptions. KFServing 0.5.x/0.6.x releases are still supported in next six months after KServe 0.7 release. We are also working on integrating core Kubeflow APIs and standards for the conformance program.

For contributors, please follow the KServe developer and doc contribution guide to make code or doc contributions. We are excited to work with you to make KServe better and promote its adoption by more and more users!

🔗 KServe Key Links

🙏 Contributor Acknowledgement

We'd like to thank all the KServe contributors for this transition work!

Individual Contributors:

Core Contributors: The KServe maintainers and Kubeflow Serving Working Group leads

Community: Everyone who supported this important transition and helped establish KServe as an independent project

🤝 Join the Community

Visit our Website or GitHub
Join the Slack (#kubeflow-kfserving)
Follow the KServe developer and doc contribution guides to make contributions

Welcome to KServe!

The KServe team is committed to making machine learning model serving simple, scalable, and standardized. Thank you for being part of this exciting transition!

KServe Blog

Announcing KServe v0.15 - Advancing Generative AI Model Serving

🤖 Embracing Generative AI Workloads​

🚀 Key Generative AI Features in v0.15​

🌐 Envoy AI Gateway Support​

🔗 Multi-Node Inference​

⚡ LLM Autoscaler with KEDA (Kubernetes Event-driven Autoscaling)​

🚀 Distributed KV Cache with LMCache​

📦 Advanced Model Caching Mechanisms​

🔧 Enhanced vLLM Backend Support​

🛠️ Additional Improvements​

🔍 Release Notes​

🙏 Acknowledgments​

🤝 Join the Community​

Announcing KServe v0.14

🚀 Key Features​

Introducing Inference client for Python​

Support for OCI storage for models (modelcars) becomes stable​

Future plan​

Introducing Model Cache​

Support for Hugging Face hub in storage initializer​

🛠️ Enhancements and Improvements​

Hugging Face vLLM backend changes​

Other Enhancements​

⚠️ What's Changed​

🔍 Release Notes​

🙏 Acknowledgments​

🤝 Join the community​

From Serverless Predictive Inference to Generative Inference - Introducing KServe v0.13

🚀 Enhanced Hugging Face Runtime Support​

🔧 vLLM Support​

🌐 OpenAI Schema Integration​

🔮 Future Plan​

🛠️ Other Changes​

✨ What's New?​

⚠️ What's Changed?​

🔍 Release Notes​

🙏 Acknowledgments​

🤝 Join the Community​

Announcing KServe v0.11

🚀 KServe Core Inference Enhancements​

📦 Enhanced Python SDK Dependency Management​

🔧 KServe Python Runtimes Improvements​

🤖 LLM Runtimes​

TorchServe LLM Runtime​

vLLM Runtime​

📊 ModelMesh Updates​

💾 Storing Models on Kubernetes Persistent Volumes (PVC)​

⚡ Horizontal Pod Autoscaling (HPA)​

📈 Model Metrics, Metrics Dashboard, Payload Event Logging​

⚠️ What's Changed?​

🔍 Release Notes​

🙏 Acknowledgments​

🤝 Join the Community​

Announcing KServe v0.10.0

🌐 KServe Networking Options​

📊 KServe Telemetry for Serving Runtimes​

🚀 Open(v2) Inference Protocol Support Coverage​

🏗️ Multi-Arch Image Support​

🔐 KServe Storage Credentials Support​

📊 ModelMesh Updates​

🔍 Release Notes​

🙏 Acknowledgments​

🤝 Join the Community​

Announcing KServe v0.9.0

🚀 Introducing InferenceGraph​

🔗 InferenceService API for ModelMesh​

💾 Storage Spec​

📊 Model Status​

🚢 Deploying on ModelMesh​

🛠️ Other New Features​

⚠️ What's Changed?​

🔍 Release Notes​

🙏 Acknowledgments​

🤝 Join the Community​

Announcing KServe v0.8

⚠️ What's Changed​

🚀 What's New​

🔧 ServingRuntimes and ClusterServingRuntimes​

📋 Updated InferenceService Predictor Spec​

🤖 Embracing Generative AI Workloads

🚀 Key Generative AI Features in v0.15

🌐 Envoy AI Gateway Support

🔗 Multi-Node Inference

⚡ LLM Autoscaler with KEDA (Kubernetes Event-driven Autoscaling)

🚀 Distributed KV Cache with LMCache

📦 Advanced Model Caching Mechanisms

🔧 Enhanced vLLM Backend Support

🛠️ Additional Improvements

🔍 Release Notes

🙏 Acknowledgments

🤝 Join the Community

🚀 Key Features

Introducing Inference client for Python

Support for OCI storage for models (modelcars) becomes stable

Future plan

Introducing Model Cache

Support for Hugging Face hub in storage initializer

🛠️ Enhancements and Improvements

Hugging Face vLLM backend changes

Other Enhancements

⚠️ What's Changed

🔍 Release Notes

🙏 Acknowledgments

🤝 Join the community

🚀 Enhanced Hugging Face Runtime Support

🔧 vLLM Support

🌐 OpenAI Schema Integration

🔮 Future Plan

🛠️ Other Changes

✨ What's New?

⚠️ What's Changed?

🔍 Release Notes

🙏 Acknowledgments

🤝 Join the Community

🚀 KServe Core Inference Enhancements

📦 Enhanced Python SDK Dependency Management

🔧 KServe Python Runtimes Improvements

🤖 LLM Runtimes

TorchServe LLM Runtime

vLLM Runtime

📊 ModelMesh Updates

💾 Storing Models on Kubernetes Persistent Volumes (PVC)

⚡ Horizontal Pod Autoscaling (HPA)

📈 Model Metrics, Metrics Dashboard, Payload Event Logging

⚠️ What's Changed?

🔍 Release Notes

🙏 Acknowledgments

🤝 Join the Community

🌐 KServe Networking Options

📊 KServe Telemetry for Serving Runtimes

🚀 Open(v2) Inference Protocol Support Coverage

🏗️ Multi-Arch Image Support

🔐 KServe Storage Credentials Support

📊 ModelMesh Updates

🔍 Release Notes

🙏 Acknowledgments

🤝 Join the Community

🚀 Introducing InferenceGraph

🔗 InferenceService API for ModelMesh

💾 Storage Spec

📊 Model Status

🚢 Deploying on ModelMesh

🛠️ Other New Features

⚠️ What's Changed?

🔍 Release Notes

🙏 Acknowledgments

🤝 Join the Community

⚠️ What's Changed

🚀 What's New

🔧 ServingRuntimes and ClusterServingRuntimes

📋 Updated InferenceService Predictor Spec

🌐 ModelMesh Updates

🔍 Release Notes

🙏 Acknowledgments

🤝 Join the Community

⚠️ What's Changed

🚀 What's New

🔧 What's Fixed

🔍 Release Notes