在具有 Intel GPU 的 Windows 上安装 IPEX-LLM


This guide demonstrates how to install IPEX-LLM on Windows with Intel GPUs.
本指南演示如何在具有 Intel GPU 的 Windows 上安装 IPEX-LLM。

It applies to Intel Core Ultra and Core 11 - 14 gen integrated GPUs (iGPUs), as well as Intel Arc Series GPU.
它适用于 Intel Core Ultra 和 Core 11 - 14 代集成 GPU (iGPU),以及 Intel Arc 系列 GPU。

Table of Contents 目录

[](https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/install_windows_gpu.md#table-of-contents)

Install Prerequisites 安装先决条件

[](https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/install_windows_gpu.md#install-prerequisites)

(Optional) Update GPU Driver

(可选)更新 GPU 驱动程序

[](https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/install_windows_gpu.md#optional-update-gpu-driver)

Tip 提示

It is recommended to update your GPU driver, if you have driver version lower than 31.0.101.5122. Refer to here for more information.
如果您的驱动程序版本低于 31.0.101.5122 ,建议更新您的 GPU 驱动程序。请参阅此处了解更多信息。

Download and install the latest GPU driver from the official Intel download page. A system reboot is necessary to apply the changes after the installation is complete.
从 Intel 官方下载页面下载并安装最新的 GPU 驱动程序。安装完成后需要重新启动系统才能应用更改。

Note 笔记

The process could take around 10 minutes. After reboot, check for the Intel Arc Control application to verify the driver has been installed correctly. If the installation was successful, you should see the Arc Control interface similar to the figure below
该过程可能需要大约 10 分钟。重新启动后,检查 Intel Arc Control 应用程序以验证驱动程序是否已正确安装。如果安装成功,您应该看到类似下图的Arc Control界面

Setup Python Environment 设置Python环境

[](https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/install_windows_gpu.md#setup-python-environment)

Visit Miniforge installation page, download the Miniforge installer for Windows, and follow the instructions to complete the installation.
访问 Miniforge 安装页面,下载 Windows 版 Miniforge 安装程序,然后按照说明完成安装。

After installation, open the Miniforge Prompt, create a new python environment llm:
安装完成后,打开Miniforge Prompt,创建一个新的python环境 llm

conda create -n llm python=3.11 libuv

Activate the newly created environment llm:
激活新创建的环境 llm

conda activate llm

Install ipex-llm 安装 ipex-llm

[](https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/install_windows_gpu.md#install-ipex-llm)

With the llm environment active, use pip to install ipex-llm for GPU. Choose either US or CN website for extra-index-url:
llm 环境处于活动状态时,使用 pip 为 GPU 安装 ipex-llm 。为 extra-index-url 选择美国或中国网站:

  • For US: 为了我们:

    pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
  • For CN: 对于中国:

    pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/

Note 笔记

If you encounter network issues while installing IPEX, refer to this guide for troubleshooting advice.
如果您在安装 IPEX 时遇到网络问题,请参阅本指南以获取故障排除建议。

Verify Installation 验证安装

[](https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/install_windows_gpu.md#verify-installation)

You can verify if ipex-llm is successfully installed following below steps.
您可以按照以下步骤验证 ipex-llm 是否已成功安装。

Step 1: Runtime Configurations

第 1 步:运行时配置

[](https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/install_windows_gpu.md#step-1-runtime-configurations)

  • Open the Miniforge Prompt and activate the Python environment llm you previously created:
    打开 Miniforge Prompt 并激活您之前创建的 Python 环境 llm

    conda activate llm
  • Set the following environment variables according to your device:
    根据您的设备设置以下环境变量:

    • For Intel iGPU: 对于英特尔 iGPU:

      set SYCL_CACHE_PERSISTENT=1
      set BIGDL_LLM_XMX_DISABLED=1
    • For Intel Arc™ A770: 对于英特尔 Arc™ A770:

      set SYCL_CACHE_PERSISTENT=1

Tip 提示

For other Intel dGPU Series, please refer to this guide for more details regarding runtime configuration.
对于其他 Intel dGPU 系列,请参阅本指南以了解有关运行时配置的更多详细信息。

Step 2: Run Python Code

第 2 步:运行 Python 代码

[](https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/install_windows_gpu.md#step-2-run-python-code)

  • Launch the Python interactive shell by typing python in the Miniforge Prompt window and then press Enter.
    在 Miniforge 提示窗口中键入 python 启动 Python 交互式 shell,然后按 Enter。
  • Copy following code to Miniforge Prompt line by line and press Enter after copying each line.
    将以下代码逐行复制到 Miniforge Prompt 中,复制每行后按 Enter。

    import torch 
    from ipex_llm.transformers import AutoModel,AutoModelForCausalLM    
    tensor_1 = torch.randn(1, 1, 40, 128).to('xpu') 
    tensor_2 = torch.randn(1, 1, 128, 40).to('xpu') 
    print(torch.matmul(tensor_1, tensor_2).size()) 

    It will output following content at the end:
    最后会输出如下内容:

    torch.Size([1, 1, 40, 40])
    Tip: 提示:

    If you encounter any problem, please refer to here for help.
    如果您遇到任何问题,请参阅此处寻求帮助。

  • To exit the Python interactive shell, simply press Ctrl+Z then press Enter (or input exit() then press Enter).
    要退出 Python 交互式 shell,只需按 Ctrl+Z 然后按 Enter(或输入 exit() 然后按 Enter)。

Monitor GPU Status 监控GPU状态

[](https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/install_windows_gpu.md#monitor-gpu-status)

To monitor your GPU's performance and status (e.g. memory consumption, utilization, etc.), you can use either the Windows Task Manager (in Performance Tab) (see the left side of the figure below) or the Arc Control application (see the right side of the figure below)
要监控 GPU 的性能和状态(例如内存消耗、利用率等),您可以使用 Windows 任务管理器(在 Performance 选项卡中)(参见下图左侧)或 Arc控制应用(见下图右侧)

A Quick Example 一个简单的例子

[](https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/install_windows_gpu.md#a-quick-example)

Now let's play with a real LLM. We'll be using the Qwen-1.8B-Chat model, a 1.8 billion parameter LLM for this demonstration. Follow the steps below to setup and run the model, and observe how it responds to a prompt "What is AI?".
现在让我们玩一个真正的LLM。我们将使用 Qwen-1.8B-Chat 模型来进行此演示,该模型有 18 亿个参数 LLM。按照以下步骤设置并运行模型,并观察它如何响应提示“什么是 AI?”。

  • Step 1: Follow Runtime Configurations Section above to prepare your runtime environment.
    第 1 步:按照上面的运行时配置部分准备运行时环境。
  • Step 2: Install additional package required for Qwen-1.8B-Chat to conduct:
    步骤2:安装Qwen-1.8B-Chat所需的附加软件包以执行:

    pip install tiktoken transformers_stream_generator einops
  • Step 3: Create code file. IPEX-LLM supports loading model from Hugging Face or ModelScope. Please choose according to your requirements.
    步骤3:创建代码文件。 IPEX-LLM 支持从 Hugging Face 或 ModelScope 加载模型。请根据您的要求进行选择。

    • For loading model from Hugging Face:
      从 Hugging Face 加载模型:

      Create a new file named demo.py and insert the code snippet below to run Qwen-1.8B-Chat model with IPEX-LLM optimizations.
      创建一个名为 demo.py 的新文件并插入下面的代码片段,以运行具有 IPEX-LLM 优化的 Qwen-1.8B-Chat 模型。

      # Copy/Paste the contents to a new file demo.py
      import torch
      from ipex_llm.transformers import AutoModelForCausalLM
      from transformers import AutoTokenizer, GenerationConfig
      generation_config = GenerationConfig(use_cache=True)
      
      print('Now start loading Tokenizer and optimizing Model...')
      tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-1_8B-Chat",
                                                trust_remote_code=True)
      
      # Load Model using ipex-llm and load it to GPU
      model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-1_8B-Chat",
                                                   load_in_4bit=True,
                                                   cpu_embedding=True,
                                                   trust_remote_code=True)
      model = model.to('xpu')
      print('Successfully loaded Tokenizer and optimized Model!')
      
      # Format the prompt
      question = "What is AI?"
      prompt = "user: {prompt}\n\nassistant:".format(prompt=question)
      
      # Generate predicted tokens
      with torch.inference_mode():
         input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
      
         print('--------------------------------------Note-----------------------------------------')
         print('| For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or |')
         print('| Pro A60, it may take several minutes for GPU kernels to compile and initialize. |')
         print('| Please be patient until it finishes warm-up...                                  |')
         print('-----------------------------------------------------------------------------------')
      
         # To achieve optimal and consistent performance, we recommend a one-time warm-up by running `model.generate(...)` an additional time before starting your actual generation tasks.
         # If you're developing an application, you can incorporate this warm-up step into start-up or loading routine to enhance the user experience.
         output = model.generate(input_ids,
                                 do_sample=False,
                                 max_new_tokens=32,
                                 generation_config=generation_config) # warm-up
      
         print('Successfully finished warm-up, now start generation...')
      
         output = model.generate(input_ids,
                                 do_sample=False,
                                 max_new_tokens=32,
                                 generation_config=generation_config).cpu()
         output_str = tokenizer.decode(output[0], skip_special_tokens=True)
         print(output_str)
    • For loading model ModelScopee:
      对于加载模型 ModelScopee:

      Please first run following command in Miniforge Prompt to install ModelScope:
      请首先在 Miniforge Prompt 中运行以下命令来安装 ModelScope:

      pip install modelscope==1.11.0

      Create a new file named demo.py and insert the code snippet below to run Qwen-1.8B-Chat model with IPEX-LLM optimizations.
      创建一个名为 demo.py 的新文件并插入下面的代码片段,以运行具有 IPEX-LLM 优化的 Qwen-1.8B-Chat 模型。

      # Copy/Paste the contents to a new file demo.py
      import torch
      from ipex_llm.transformers import AutoModelForCausalLM
      from transformers import GenerationConfig
      from modelscope import AutoTokenizer
      generation_config = GenerationConfig(use_cache=True)
      
      print('Now start loading Tokenizer and optimizing Model...')
      tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-1_8B-Chat",
                                                trust_remote_code=True)
      
      # Load Model using ipex-llm and load it to GPU
      model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-1_8B-Chat",
                                                   load_in_4bit=True,
                                                   cpu_embedding=True,
                                                   trust_remote_code=True,
                                                   model_hub='modelscope')
      model = model.to('xpu')
      print('Successfully loaded Tokenizer and optimized Model!')
      
      # Format the prompt
      question = "What is AI?"
      prompt = "user: {prompt}\n\nassistant:".format(prompt=question)
      
      # Generate predicted tokens
      with torch.inference_mode():
         input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
      
         print('--------------------------------------Note-----------------------------------------')
         print('| For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or |')
         print('| Pro A60, it may take several minutes for GPU kernels to compile and initialize. |')
         print('| Please be patient until it finishes warm-up...                                  |')
         print('-----------------------------------------------------------------------------------')
      
         # To achieve optimal and consistent performance, we recommend a one-time warm-up by running `model.generate(...)` an additional time before starting your actual generation tasks.
         # If you're developing an application, you can incorporate this warm-up step into start-up or loading routine to enhance the user experience.
         output = model.generate(input_ids,
                                 do_sample=False,
                                 max_new_tokens=32,
                                 generation_config=generation_config) # warm-up
      
         print('Successfully finished warm-up, now start generation...')
      
         output = model.generate(input_ids,
                                 do_sample=False,
                                 max_new_tokens=32,
                                 generation_config=generation_config).cpu()
         output_str = tokenizer.decode(output[0], skip_special_tokens=True)
         print(output_str)
      Note: 笔记:

      Please note that the repo id on ModelScope may be different from Hugging Face for some models.
      请注意,对于某些型号,ModelScope 上的存储库 ID 可能与 Hugging Face 上的存储库 ID 不同。

Note 笔记

When running LLMs on Intel iGPUs with limited memory size, we recommend setting cpu_embedding=True in the from_pretrained function. This will allow the memory-intensive embedding layer to utilize the CPU instead of GPU.
在内存大小有限的 Intel iGPU 上运行 LLMs 时,我们建议在 from_pretrained 函数中设置 cpu_embedding=True 。这将允许内存密集型嵌入层利用 CPU 而不是 GPU。

  • Step 4. Run demo.py within the activated Python environment using the following command:
    步骤 4. 使用以下命令在激活的 Python 环境中运行 demo.py

    python demo.py

Example output 输出示例

[](https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/install_windows_gpu.md#example-output)

Example output on a system equipped with an Intel Core Ultra 5 125H CPU and Intel Arc Graphics iGPU:
配备 Intel Core Ultra 5 125H CPU 和 Intel Arc Graphics iGPU 的系统上的示例输出:

user: What is AI?

assistant: AI stands for Artificial Intelligence, which refers to the development of computer systems that can perform tasks that typically require human intelligence, such as visual perception, speech recognition,

Tips & Troubleshooting 提示和故障排除

[](https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/install_windows_gpu.md#tips--troubleshooting)

https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/install_windows_gpu.md

Warm-up for optimal performance on first run

首次运行时预热以获得最佳性能

[](https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/install_windows_gpu.md#warm-up-for-optimal-performance-on-first-run)

When running LLMs on GPU for the first time, you might notice the performance is lower than expected, with delays up to several minutes before the first token is generated.
首次在 GPU 上运行 LLMs 时,您可能会注意到性能低于预期,在生成第一个令牌之前延迟长达几分钟。
This delay occurs because the GPU kernels require compilation and initialization, which varies across different GPU types. To achieve optimal and consistent performance, we recommend a one-time warm-up by running model.generate(...) an additional time before starting your actual generation tasks. If you're developing an application, you can incorporate this warm-up step into start-up or loading routine to enhance the user experience.
出现这种延迟的原因是 GPU 内核需要编译和初始化,而不同的 GPU 类型的编译和初始化有所不同。为了实现最佳且一致的性能,我们建议在开始实际生成任务之前额外运行 model.generate(...) 一次来进行一次性预热。如果您正在开发应用程序,则可以将此预热步骤合并到启动或加载例程中,以增强用户体验。

声明:八零秘林|版权所有,违者必究|如未注明,均为原创|本网站采用BY-NC-SA协议进行授权

转载:转载请注明原文链接 - 在具有 Intel GPU 的 Windows 上安装 IPEX-LLM


记忆碎片 · 精神拾荒