Manages a group of API server processes.
Handles creation, monitoring, and termination of API server worker processes. Also monitors extra processes to check if they are healthy.
Source code in vllm/v1/utils.py
  
 __init__(
    target_server_fn: Callable,
    listen_address: str,
    sock: Any,
    args: Namespace,
    num_servers: int,
    input_addresses: list[str],
    output_addresses: list[str],
    stats_update_address: str | None = None,
)
Initialize and start API server worker processes.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| target_server_fn | Callable | Function to call for each API server process | required | 
| listen_address | str | Address to listen for client connections | required | 
| sock | Any | Socket for client connections | required | 
| args | Namespace | Command line arguments | required | 
| num_servers | int | Number of API server processes to start | required | 
| input_addresses | list[str] | Input addresses for each API server | required | 
| output_addresses | list[str] | Output addresses for each API server | required | 
| stats_update_address | str | None | Optional stats update address | None | 
Source code in vllm/v1/utils.py
  
 Source code in vllm/v1/utils.py
  
 Buffer to easily copy tensors between CPU and GPU.
Source code in vllm/v1/utils.py
  
 __init__(
    *size: int | SymInt,
    dtype: dtype,
    device: device,
    pin_memory: bool,
    with_numpy: bool = True,
) -> None
Source code in vllm/v1/utils.py
  
  NOTE: Because this method is non-blocking, explicit synchronization is needed to ensure the data is copied to CPU.
Source code in vllm/v1/utils.py
  
  Copy the first length elements of a tensor into another tensor in a non-blocking manner.
Used to copy pinned CPU tensor data to pre-allocated GPU tensors.
Returns the sliced target tensor.
Source code in vllm/v1/utils.py
  
  Assign a new ZMQ socket address.
If local_only is True, participants are colocated and so a unique IPC address will be returned.
Otherwise, the provided host and port will be used to construct a TCP address (port == 0 means assign an available port).
Source code in vllm/v1/utils.py
  
 record_function_or_nullcontext(
    name: str,
) -> AbstractContextManager
Source code in vllm/v1/utils.py
  
 report_usage_stats(
    vllm_config,
    usage_context: UsageContext = ENGINE_CONTEXT,
) -> None
Report usage statistics if enabled.
Source code in vllm/v1/utils.py
  
 shutdown(procs: list[BaseProcess])
Source code in vllm/v1/utils.py
  
 wait_for_completion_or_failure(
    api_server_manager: APIServerProcessManager,
    engine_manager: Union[
        CoreEngineProcManager, CoreEngineActorManager
    ]
    | None = None,
    coordinator: Optional[DPCoordinator] = None,
) -> None
Wait for all processes to complete or detect if any fail.
Raises an exception if any process exits with a non-zero status.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| api_server_manager | APIServerProcessManager | The manager for API servers. | required | 
| engine_manager | Union[CoreEngineProcManager, CoreEngineActorManager] | None | The manager for engine processes. If CoreEngineProcManager, it manages local engines; if CoreEngineActorManager, it manages all engines. | None | 
| coordinator | Optional[DPCoordinator] | The coordinator for data parallel. | None |