内蒙古做网站找谁,小学生个人网站怎么做,做空闲时间的网站,手机网站 htmlAIGC系列博文#xff1a; 【AIGC系列】1#xff1a;自编码器#xff08;AutoEncoder, AE#xff09; 【AIGC系列】2#xff1a;DALLE 2模型介绍#xff08;内含扩散模型介绍#xff09; 【AIGC系列】3#xff1a;Stable Diffusion模型原理介绍 【AIGC系列】4#xff1… AIGC系列博文 【AIGC系列】1自编码器AutoEncoder, AE 【AIGC系列】2DALL·E 2模型介绍内含扩散模型介绍 【AIGC系列】3Stable Diffusion模型原理介绍 【AIGC系列】4Stable Diffusion应用实践和代码分析 【AIGC系列】5视频生成模型数据处理和预训练流程介绍Sora、MovieGen、HunyuanVideo 【AIGC系列】6HunyuanVideo视频生成模型部署和代码分析 目录 1 前言2 部署2.1 环境配置2.1.1 方法一使用Open-R1的环境2.1.2 方法二使用官方Docker 2.2 下载预训练模型2.2.1 混元Diffusion模型和VAE模型2.2.2 text-encoder-tokenizer2.2.3 CLIP模型 2.3 视频生成命令 3 源码分析3.1 推理流程3.2 模型初始化3.3 模型推理3.4 模型结构3.4.1 双流块 MMDoubleStreamBlock3.4.2 单流块 MMSingleStreamBlock3.4.3 混元主干网络HYVideoDiffusionTransformer   1 前言 
先展示一下结果。 
生成540p视频推理占用的显存超过40G720p x 1280p 分辨率大约需要76G显存544p x 960p分辨率大约需要43G显存 迭代50步需要超过一个半小时还是挺久的。 PromptA cat walks on the grass, realistic style. 
生成的视频结果如下效果还是很逼真的包括光影、猫毛的细节、精深等等。 
视频链接哔哩哔哩 2 部署 
2.1 环境配置 
2.1.1 方法一使用Open-R1的环境 
HuyuanVideo也使用到flash attention了为了方便起见我们使用Open-R1的环境来跑HunyuanVideo。 
Open-R1的环境配置详细步骤请参考博文【复现DeepSeek-R1之Open R1实战】系列1跑通SFT一步步操作手把手教学。 
将源码clone下来之后接下来我们安装HunyuanVideo的依赖库 
python -m pip install -r requirements.txt2.1.2 方法二使用官方Docker 
当然我们也可以直接使用官方提供的docker 
# 1. Use the following link to download the docker image tar file (For CUDA 12).
wget https://aivideo.hunyuan.tencent.com/download/HunyuanVideo/hunyuan_video_cu12.tar# 2. Import the docker tar file and show the image meta information (For CUDA 12).
docker load -i hunyuan_video_cu12.tardocker image ls# 3. Run the container based on the image
docker run -itd --gpus all --init --nethost --utshost --ipchost --name hunyuanvideo --security-optseccompunconfined --ulimitstack67108864 --ulimitmemlock-1 --privileged  docker_image_tag推荐使用方法一方法二我尝试了一下报了个torch的错误后来我没接着往下解决GitHub上也有小伙伴反馈cuda12的docker跑起来会有些问题感兴趣的小伙伴也可使用cuda11.8的docker。 
# For CUDA 12.4 (updated to avoid float point exception)
docker pull hunyuanvideo/hunyuanvideo:cuda_12
docker run -itd --gpus all --init --nethost --utshost --ipchost --name hunyuanvideo --security-optseccompunconfined --ulimitstack67108864 --ulimitmemlock-1 --privileged hunyuanvideo/hunyuanvideo:cuda_12# For CUDA 11.8
docker pull hunyuanvideo/hunyuanvideo:cuda_11
docker run -itd --gpus all --init --nethost --utshost --ipchost --name hunyuanvideo --security-optseccompunconfined --ulimitstack67108864 --ulimitmemlock-1 --privileged hunyuanvideo/hunyuanvideo:cuda_112.2 下载预训练模型 
我们需要下载的模型包括混元Diffusion模型、VAE模型、text-encoder-tokenizer模型以及CLIP模型。 
2.2.1 混元Diffusion模型和VAE模型 
HuggingFacehttps://huggingface.co/tencent/HunyuanVideo/tree/main。 建议下载FP8量化模型推理时总共占用显存40多G。 Diffusion模型放入ckpts/hunyuan-video-t2v-720p/transformers目录VAE模型放入ckpts/hunyuan-video-t2v-720p/vae目录。 
2.2.2 text-encoder-tokenizer 
我们可以直接下载text-encoder-tokenizer模型https://huggingface.co/Kijai/llava-llama-3-8b-text-encoder-tokenizer保存在ckpts/text_encoder文件夹中。 官方的操作是先下载完整版LLaVA 模型然后再把文本编码器language_model和分词器tokenizer提取出来多了一步提取的操作。 2.2.3 CLIP模型 
下载CLIP-ViT-L模型https://huggingface.co/openai/clip-vit-large-patch14将模型存放到ckpts/text_encoder_2文件夹中。 
最终保存的预训练模型路径如下所示 
HunyuanVideo ├──ckpts │ ├──README.md │ ├──hunyuan-video-t2v-720p │ │ ├──transformers │ │ │ ├──mp_rank_00_model_states.pt │ │ │ ├──mp_rank_00_model_states_fp8.pt │ │ │ ├──mp_rank_00_model_states_fp8_map.pt ├ │ ├──vae │ ├──text_encoder │ ├──text_encoder_2 ├──… 2.3 视频生成命令 
使用FP8模型推理 
cd HunyuanVideopython3 sample_video.py \--dit-weight ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states_fp8.pt \--video-size 1280 720 \--video-length 129 \--infer-steps 50 \--prompt A cat walks on the grass, realistic style. \--seed 42 \--embedded-cfg-scale 6.0 \--flow-shift 7.0 \--flow-reverse \--use-cpu-offload \--use-fp8 \--save-path ./results–dit-weight指定FP8模型的路径–use-fp8是指模型的格式是FP8。 
参数说明 
ArgumentDefaultDescription--promptNoneThe text prompt for video generation--video-size720 1280The size of the generated video--video-length129The length of the generated video--infer-steps50The number of steps for sampling--embedded-cfg-scale6.0Embedded Classifier free guidance scale--flow-shift7.0Shift factor for flow matching schedulers--flow-reverseFalseIf reverse, learning/sampling from t1 - t0--seedNoneThe random seed for generating video, if None, we init a random seed--use-cpu-offloadFalseUse CPU offload for the model load to save more memory, necessary for high-res video generation--save-path./resultsPath to save the generated video 
当然我们也可以用多卡推理 
cd HunyuanVideotorchrun --nproc_per_node8 sample_video.py \--video-size 1280 720 \--video-length 129 \--infer-steps 50 \--prompt A cat walks on the grass, realistic style. \--flow-reverse \--seed 42 \--ulysses-degree 8 \--ring-degree 1 \--save-path ./results3 源码分析 
3.1 推理流程 
上面我们展示的demo的主文件是sample_video.py定义了一个main函数用于加载模型并生成视频样本包含整体推理流程。 
def main():# 调用 parse_args() 函数来解析命令行参数并将结果存储在 args 中。args  parse_args()print(args)#使用 args.model_base 指定模型的基础路径。检查路径是否存在如果路径不存在抛出异常models_root_path  Path(args.model_base)if not models_root_path.exists():raise ValueError(fmodels_root not exists: {models_root_path})# 创建保存目录save_path  args.save_path if args.save_path_suffix else f{args.save_path}_{args.save_path_suffix}if not os.path.exists(args.save_path):os.makedirs(save_path, exist_okTrue)# 加载指定路径下的预训练模型并传入 args 参数hunyuan_video_sampler  HunyuanVideoSampler.from_pretrained(models_root_path, argsargs)# 更新 args 为模型内部参数确保之后使用的一致性args  hunyuan_video_sampler.args#  生成视频样本outputs  hunyuan_video_sampler.predict(promptargs.prompt, # 文本提示词用于引导生成内容。heightargs.video_size[0],# 视频帧的分辨率高度和宽度widthargs.video_size[1],video_lengthargs.video_length, # 视频时长帧数。seedargs.seed, # 随机种子用于结果的可重复性negative_promptargs.neg_prompt, # 负向提示词指定生成时需避免的特性infer_stepsargs.infer_steps, # 推理步数guidance_scaleargs.cfg_scale,# 引导系数控制生成质量。num_videos_per_promptargs.num_videos, # 每个提示生成的视频数量。flow_shiftargs.flow_shift, # 时间帧间的流动控制参数batch_sizeargs.batch_size, # 批处理大小embedded_guidance_scaleargs.embedded_cfg_scale # 内嵌引导系数用于调节特定特征)samples  outputs[samples]# 保存视频样本# 检查环境变量 LOCAL_RANK 是否存在用于分布式训练的本地进程控制如果不存在或者值为 0即主进程则继续保存样本。if LOCAL_RANK not in os.environ or int(os.environ[LOCAL_RANK])  0:# 遍历生成的samplesfor i, sample in enumerate(samples):sample  samples[i].unsqueeze(0)# 添加时间戳time_flag  datetime.fromtimestamp(time.time()).strftime(%Y-%m-%d-%H:%M:%S)save_path  f{save_path}/{time_flag}_seed{outputs[seeds][i]}_{outputs[prompts][i][:100].replace(/,)}.mp4save_videos_grid(sample, save_path, fps24) # 保存视频logger.info(fSample save to: {save_path}) # 日志记录3.2 模型初始化 
加载模型的时候调用了HunYuanVideoSampler的from_pretrained()初始化实例改方法是在其Inference实现的hyvideo/inference.py文件功能包括初始化vae、text_encoder、扩散模型等核心部件然后通过cls初始化并返回一个实例而HunYuanVideoSampler类继承了父类的from_pretrained()方法因此这里cls返回的是HunYuanVideoSampler的实例。 
class Inference(object):...classmethoddef from_pretrained(cls, pretrained_model_path, args, deviceNone, **kwargs):...in_channels  args.latent_channels      # 16out_channels  args.latent_channels     # 16model  load_model(     # HYVideoDiffusionTransformerargs,in_channelsin_channels,out_channelsout_channels,factor_kwargsfactor_kwargs,)...# VAEvae, _, s_ratio, t_ratio  load_vae(     # AutoencoderKLCausal3Dargs.vae,args.vae_precision,loggerlogger,devicedevice if not args.use_cpu_offload else cpu,)# Text encoderif args.prompt_template_video is not None:crop_start  PROMPT_TEMPLATE[args.prompt_template_video].get(   # 95crop_start, 0)...max_length  args.text_len  crop_start     # 25695# prompt_templateprompt_template  (PROMPT_TEMPLATE[args.prompt_template]if args.prompt_template is not Noneelse None)# prompt_template_videoprompt_template_video  (PROMPT_TEMPLATE[args.prompt_template_video]if args.prompt_template_video is not Noneelse None)text_encoder  TextEncoder(     # text-encoder-tokenizertext_encoder_typeargs.text_encoder,        # llavamax_lengthmax_length,text_encoder_precisionargs.text_encoder_precision,tokenizer_typeargs.tokenizer,prompt_templateprompt_template,prompt_template_videoprompt_template_video,hidden_state_skip_layerargs.hidden_state_skip_layer,apply_final_normargs.apply_final_norm,reproduceargs.reproduce,loggerlogger,devicedevice if not args.use_cpu_offload else cpu,)text_encoder_2  Noneif args.text_encoder_2 is not None:text_encoder_2  TextEncoder(text_encoder_typeargs.text_encoder_2,      # clipLmax_lengthargs.text_len_2,text_encoder_precisionargs.text_encoder_precision_2,tokenizer_typeargs.tokenizer_2,reproduceargs.reproduce,loggerlogger,devicedevice if not args.use_cpu_offload else cpu,)return cls(     # 初始化本类的一个实例argsargs,vaevae,        # AutoencoderKLCausal3Dvae_kwargsvae_kwargs,text_encodertext_encoder,      # llmtext_encoder_2text_encoder_2,modelmodel,use_cpu_offloadargs.use_cpu_offload,devicedevice,loggerlogger,)最后使用了cls()会调用HunyuanVideoSampler的初始化方法__init__()指定所有组件并将他们组合到一个pipeline包括模型、调度器scheduler、设备配置等必要的组件然后指定负面提示词。 
class HunyuanVideoSampler(Inference):def __init__(...):super().__init__(...)self.pipeline  self.load_diffusion_pipeline(       # 组合所有原件argsargs,vaeself.vae,text_encoderself.text_encoder,text_encoder_2self.text_encoder_2,modelself.model,deviceself.device,)self.default_negative_prompt  NEGATIVE_PROMPT      # 负面提示词def load_diffusion_pipeline(self,args,vae,text_encoder,text_encoder_2,model,schedulerNone,deviceNone,progress_bar_configNone,data_typevideo,):Load the denoising scheduler for inference.# 去噪调度器的初始化if scheduler is None:if args.denoise_type  flow:# 流动匹配的去噪策略离散去噪调度器可能用于视频生成任务中时间帧之间的一致性建模。# 负责指导扩散模型逐步还原噪声生成清晰的视频帧。scheduler  FlowMatchDiscreteScheduler(shiftargs.flow_shift, # 流动偏移值。reverseargs.flow_reverse, # 是否反向计算。solverargs.flow_solver, # 去噪求解器的类型)else:raise ValueError(fInvalid denoise type {args.denoise_type})# 构建推理pipelinepipeline  HunyuanVideoPipeline(vaevae, # 负责特征编码和解码的模块。text_encodertext_encoder, # 用于处理文本提示生成与视频生成相关的特征。text_encoder_2text_encoder_2,transformermodel, # 主扩散模型生成视频的核心模块。schedulerscheduler, # 去噪调度器控制扩散生成的时间步长和过程progress_bar_configprogress_bar_config, # 可选的进度条配置用于显示推理进度。argsargs, # 配置参数的集合)# 配置计算资源if self.use_cpu_offload:# 将部分计算任务卸载到 CPU。这是显存不足时的优化策略可以大幅降低 GPU 的显存占用。pipeline.enable_sequential_cpu_offload()else:# 如果为 False直接将管道加载到指定的 device如 GPU上运行pipeline  pipeline.to(device)return pipeline 
提示词如下 
PROMPT_TEMPLATE_ENCODE  (|start_header_id|system|end_header_id|\n\nDescribe the image by detailing the color, shape, size, texture, quantity, text, spatial relationships of the objects and background:|eot_id||start_header_id|user|end_header_id|\n\n{}|eot_id|
) 
PROMPT_TEMPLATE_ENCODE_VIDEO  (|start_header_id|system|end_header_id|\n\nDescribe the video by detailing the following aspects: 1. The main content and theme of the video.2. The color, shape, size, texture, quantity, text, and spatial relationships of the objects.3. Actions, events, behaviors temporal relationships, physical movement changes of the objects.4. background environment, light, style and atmosphere.5. camera angles, movements, and transitions used in the video:|eot_id||start_header_id|user|end_header_id|\n\n{}|eot_id|
)  NEGATIVE_PROMPT  Aerial view, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortionPROMPT_TEMPLATE  {dit-llm-encode: {template: PROMPT_TEMPLATE_ENCODE,crop_start: 36,},dit-llm-encode-video: {template: PROMPT_TEMPLATE_ENCODE_VIDEO,crop_start: 95,},
}3.3 模型推理 
位置编码使用的是RoPE主要根据输入的视频维度、网络配置以及位置嵌入参数生成对应的正弦和余弦频率嵌入。 def get_rotary_pos_embed(self, video_length, height, width):# video_length: 视频的帧长度。# height, width: 视频的帧高和帧宽。 目标是根据这些维度计算位置嵌入。# 表示生成的 RoPE 的目标维度3D: 时间维度  空间高度和宽度target_ndim  3# 推导潜在特征latent feature所需维度的辅助变量ndim  5 - 2# 根据 self.args.vae 中的配置例如 VAE 模型类型 884 或 888确定潜在特征的空间尺寸 latents_size# 884: 时间维度下采样 4 倍1/4空间高宽下采样 8 倍1/8。if 884 in self.args.vae:latents_size  [(video_length - 1) // 4  1, height // 8, width // 8]# 888: 时间维度下采样 8 倍1/8空间高宽下采样 8 倍。elif 888 in self.args.vae:latents_size  [(video_length - 1) // 8  1, height // 8, width // 8]# 默认情况下不对时间维度下采样但高宽依然下采样 8 倍。else:latents_size  [video_length, height // 8, width // 8]# 检查潜在空间尺寸是否与 Patch 尺寸兼容# 如果 self.model.patch_size 是单个整数检查潜在特征维度的每一维是否能被 patch_size 整除。if isinstance(self.model.patch_size, int):assert all(s % self.model.patch_size  0 for s in latents_size), (fLatent size(last {ndim} dimensions) should be divisible by patch size({self.model.patch_size}), fbut got {latents_size}.)# 如果整除计算 RoPE 的输入尺寸 rope_sizes将 latents_size 每一维除以 patch_sizerope_sizes  [s // self.model.patch_size for s in latents_size]# 如果 self.model.patch_size 是一个列表分别对每一维进行整除检查和计算。elif isinstance(self.model.patch_size, list):assert all(s % self.model.patch_size[idx]  0for idx, s in enumerate(latents_size)), (fLatent size(last {ndim} dimensions) should be divisible by patch size({self.model.patch_size}), fbut got {latents_size}.)rope_sizes  [s // self.model.patch_size[idx] for idx, s in enumerate(latents_size)]# 如果 rope_sizes 的维度数不足 target_ndim在开头补充时间维度值为 1。if len(rope_sizes) ! target_ndim:rope_sizes  [1] * (target_ndim - len(rope_sizes))  rope_sizes  # time axis# head_dim 是单个注意力头的维度大小由模型的 hidden_size 和 heads_num 计算得出。head_dim  self.model.hidden_size // self.model.heads_num# rope_dim_list 是用于位置嵌入的维度分配列表# 如果未定义默认将 head_dim 平均分配到 target_ndim时间、高度、宽度。rope_dim_list  self.model.rope_dim_listif rope_dim_list is None:rope_dim_list  [head_dim // target_ndim for _ in range(target_ndim)]assert (sum(rope_dim_list)  head_dim), sum(rope_dim_list) should equal to head_dim of attention layer# 调用 get_nd_rotary_pos_embed 函数计算基于目标尺寸 rope_sizes 和维度分配 rope_dim_list 的多维旋转位置嵌入。freqs_cos, freqs_sin  get_nd_rotary_pos_embed(rope_dim_list,rope_sizes,thetaself.args.rope_theta, #控制位置嵌入频率。use_realTrue, # 表示使用真实数值而非复数形式。theta_rescale_factor1, # 无缩放因子。)#返回 freqs_cos: 余弦频率嵌入。freqs_sin: 正弦频率嵌入。return freqs_cos, freqs_sinpredict 函数用于从文本生成视频或图像的预测函数通过输入文本 prompt结合其他参数如视频分辨率、帧数、推理步数等生成指定数量的视频或图像。 torch.no_grad()def predict(self,prompt,height192,width336,video_length129,seedNone,negative_promptNone,infer_steps50,guidance_scale6,flow_shift5.0,embedded_guidance_scaleNone,batch_size1,num_videos_per_prompt1,**kwargs,):Predict the image/video from the given text.Args:prompt (str or List[str]): The input text.kwargs:height (int): The height of the output video. Default is 192.width (int): The width of the output video. Default is 336.video_length (int): The frame number of the output video. Default is 129.seed (int or List[str]): The random seed for the generation. Default is a random integer.negative_prompt (str or List[str]): The negative text prompt. Default is an empty string.guidance_scale (float): The guidance scale for the generation. Default is 6.0.num_images_per_prompt (int): The number of images per prompt. Default is 1.infer_steps (int): The number of inference steps. Default is 100.# 分布式环境检查if self.parallel_args[ulysses_degree]  1 or self.parallel_args[ring_degree]  1:assert seed is not None, \You have to set a seed in the distributed environment, please rerun with --seed your-seed.# 满足分布式环境的条件调用 parallelize_transformer 函数并行化模型parallelize_transformer(self.pipeline)# 初始化一个空字典 out_dict用于存储最终的生成结果。out_dict  dict()# # 根据传入的 seed 参数生成一组随机种子并将这些种子用于初始化随机数生成器 (torch.Generator) 来控制生成过程的随机性。# # 根据 seed 参数的类型None、int、list、tuple 或 torch.Tensor执行不同的逻辑生成用于控制随机数生成器的 seeds 列表if isinstance(seed, torch.Tensor):seed  seed.tolist()if seed is None:seeds  [random.randint(0, 1_000_000)for _ in range(batch_size * num_videos_per_prompt)]elif isinstance(seed, int):seeds  [seed  ifor _ in range(batch_size)for i in range(num_videos_per_prompt)]elif isinstance(seed, (list, tuple)):if len(seed)  batch_size:seeds  [int(seed[i])  jfor i in range(batch_size)for j in range(num_videos_per_prompt)]elif len(seed)  batch_size * num_videos_per_prompt:seeds  [int(s) for s in seed]else:raise ValueError(fLength of seed must be equal to number of prompt(batch_size) or fbatch_size * num_videos_per_prompt ({batch_size} * {num_videos_per_prompt}), got {seed}.)else:raise ValueError(fSeed must be an integer, a list of integers, or None, got {seed}.)# 对每个种子在指定设备self.device上创建一个 PyTorch 的随机数生成器 torch.Generator并使用对应的种子进行手动初始化# manual_seed(seed)。将这些生成器存储在列表 generator 中。generator  [torch.Generator(self.device).manual_seed(seed) for seed in seeds]# 将生成的 seeds 列表存储在 out_dict 中供后续使用可能用于复现生成结果或记录生成过程的随机性。out_dict[seeds]  seeds# # 检查和调整视频生成的输入参数height、width 和 video_length的合法性与对齐要求并计算出调整后的目标尺寸。# # 检查输入的 height、width 和 video_length 是否为正整数if width  0 or height  0 or video_length  0:raise ValueError(fheight and width and video_length must be positive integers, got height{height}, width{width}, video_length{video_length})# 检查 video_length - 1 是否为 4 的倍数if (video_length - 1) % 4 ! 0:raise ValueError(fvideo_length-1 must be a multiple of 4, got {video_length})# 日志记录logger.info(fInput (height, width, video_length)  ({height}, {width}, {video_length}))# 目标高度和宽度对齐到 16 的倍数target_height  align_to(height, 16)target_width  align_to(width, 16)target_video_length  video_length# 存储目标尺寸out_dict[size]  (target_height, target_width, target_video_length)# # 检查和处理文本生成任务中的 prompt 和 negative_prompt 参数# # 确保输入的 prompt 是字符串类型if not isinstance(prompt, str):raise TypeError(fprompt must be a string, but got {type(prompt)})prompt  [prompt.strip()] # 对 prompt 去除首尾多余的空格使用 .strip()然后包装成一个单元素列表# 处理 negative_prompt 参数if negative_prompt is None or negative_prompt  :negative_prompt  self.default_negative_promptif not isinstance(negative_prompt, str):raise TypeError(fnegative_prompt must be a string, but got {type(negative_prompt)})negative_prompt  [negative_prompt.strip()]# # 设置调度器 (Scheduler)# scheduler  FlowMatchDiscreteScheduler( # 处理流Flow的调度shiftflow_shift, # 控制流动调度器的偏移量。flow_shift 通常与时序或流动模型相关例如调整时间步之间的关系。reverseself.args.flow_reverse, # 决定是否反向调度可能是在推理过程中逆序生成帧solverself.args.flow_solver # 指定用于调度的解算器类型solver例如选择数值方法来优化时间步间的计算。)self.pipeline.scheduler  scheduler# # 构建旋转位置嵌入 (Rotary Positional Embedding)# # 根据目标视频长度、高度和宽度生成正弦 (freqs_sin) 和余弦 (freqs_cos) 的频率嵌入。freqs_cos, freqs_sin  self.get_rotary_pos_embed(target_video_length, target_height, target_width)# 表示视频中总的编码标记数tokens通常等于时间步数帧数与空间分辨率像素数相乘。n_tokens  freqs_cos.shape[0]# # 打印推理参数# debug_str  fheight: {target_height}width: {target_width}video_length: {target_video_length}prompt: {prompt}neg_prompt: {negative_prompt}seed: {seed}infer_steps: {infer_steps}num_videos_per_prompt: {num_videos_per_prompt}guidance_scale: {guidance_scale}n_tokens: {n_tokens}flow_shift: {flow_shift}embedded_guidance_scale: {embedded_guidance_scale}logger.debug(debug_str)# # Pipeline inference# start_time  time.time()samples  self.pipeline(promptprompt, # 文本提示用于指导生成内容。heighttarget_height, # 生成图像或视频帧的分辨率。widthtarget_width, #video_lengthtarget_video_length, # 视频的帧数。如果 video_length  1表示生成视频否则生成单张图像。num_inference_stepsinfer_steps, # 推理步数决定生成过程的细粒度程度步数越多生成结果越精细。guidance_scaleguidance_scale, # 指导比例控制生成与 prompt 的一致性程度。negative_promptnegative_prompt, # 负面提示用于约束生成内容避免不期望的结果。num_videos_per_promptnum_videos_per_prompt, # 每条提示生成的视频数量。generatorgenerator, # 随机生成器对象用于控制生成过程中的随机性通常与随机种子结合。output_typepil, # 指定输出格式为 PIL.Image 对象便于后续处理freqs_cis(freqs_cos, freqs_sin), # 旋转位置嵌入 (RoPE) 的频率矩阵增强时空位置感知能力。n_tokensn_tokens, # 输入序列的总标记数用于指导生成过程。embedded_guidance_scaleembedded_guidance_scale, # 嵌入式指导比例用于进一步优化嵌入向量的生成。data_typevideo if target_video_length  1 else image, # 指定生成目标为视频或图像取决于帧数。is_progress_barTrue, # 显示推理进度条方便监控生成进度。vae_verself.args.vae, # 使用指定版本的 VAE变分自编码器决定生成内容的潜在空间。enable_tilingself.args.vae_tiling, # 启用 VAE 分块处理提高内存效率特别适用于高分辨率生成。)[0] # 返回生成的样本通常是一个 PIL.Image 或视频帧序列# 保存生成结果out_dict[samples]  samplesout_dict[prompts]  prompt# 计算并记录推理时间gen_time  time.time() - start_timelogger.info(fSuccess, time: {gen_time})return out_dict 
推理pipelinehyvideo/diffusion/pipelines/pipeline_hunyuan_video.pycall 方法接受用户输入的提示、生成图像或视频的尺寸以及其他生成过程的参数完成推理并返回生成的图像或视频。 torch.no_grad()replace_example_docstring(EXAMPLE_DOC_STRING)def __call__(self,prompt: Union[str, List[str]],height: int,width: int,video_length: int,data_type: str  video,num_inference_steps: int  50,timesteps: List[int]  None,sigmas: List[float]  None,guidance_scale: float  7.5,negative_prompt: Optional[Union[str, List[str]]]  None,num_videos_per_prompt: Optional[int]  1,eta: float  0.0,generator: Optional[Union[torch.Generator, List[torch.Generator]]]  None,latents: Optional[torch.Tensor]  None,prompt_embeds: Optional[torch.Tensor]  None,attention_mask: Optional[torch.Tensor]  None,negative_prompt_embeds: Optional[torch.Tensor]  None,negative_attention_mask: Optional[torch.Tensor]  None,output_type: Optional[str]  pil,return_dict: bool  True,cross_attention_kwargs: Optional[Dict[str, Any]]  None,guidance_rescale: float  0.0,clip_skip: Optional[int]  None,callback_on_step_end: Optional[Union[Callable[[int, int, Dict], None],PipelineCallback,MultiPipelineCallbacks,]]  None,callback_on_step_end_tensor_inputs: List[str]  [latents],freqs_cis: Tuple[torch.Tensor, torch.Tensor]  None,vae_ver: str  88-4c-sd,enable_tiling: bool  False,n_tokens: Optional[int]  None,embedded_guidance_scale: Optional[float]  None,**kwargs,):rThe call function to the pipeline for generation.Args:prompt (str or List[str]):The prompt or prompts to guide image generation. If not defined, you need to pass prompt_embeds.height (int):The height in pixels of the generated image.width (int):The width in pixels of the generated image.video_length (int):The number of frames in the generated video.num_inference_steps (int, *optional*, defaults to 50):The number of denoising steps. More denoising steps usually lead to a higher quality image at theexpense of slower inference.timesteps (List[int], *optional*):Custom timesteps to use for the denoising process with schedulers which support a timesteps argumentin their set_timesteps method. If not defined, the default behavior when num_inference_steps ispassed will be used. Must be in descending order.sigmas (List[float], *optional*):Custom sigmas to use for the denoising process with schedulers which support a sigmas argument intheir set_timesteps method. If not defined, the default behavior when num_inference_steps is passedwill be used.guidance_scale (float, *optional*, defaults to 7.5):A higher guidance scale value encourages the model to generate images closely linked to the textprompt at the expense of lower image quality. Guidance scale is enabled when guidance_scale  1.negative_prompt (str or List[str], *optional*):The prompt or prompts to guide what to not include in image generation. If not defined, you need topass negative_prompt_embeds instead. Ignored when not using guidance (guidance_scale  1).num_videos_per_prompt (int, *optional*, defaults to 1):The number of images to generate per prompt.eta (float, *optional*, defaults to 0.0):Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only appliesto the [~schedulers.DDIMScheduler], and is ignored in other schedulers.generator (torch.Generator or List[torch.Generator], *optional*):A [torch.Generator](https://pytorch.org/docs/stable/generated/torch.Generator.html) to makegeneration deterministic.latents (torch.Tensor, *optional*):Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for imagegeneration. Can be used to tweak the same generation with different prompts. If not provided, a latentstensor is generated by sampling using the supplied random generator.prompt_embeds (torch.Tensor, *optional*):Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If notprovided, text embeddings are generated from the prompt input argument.negative_prompt_embeds (torch.Tensor, *optional*):Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). Ifnot provided, negative_prompt_embeds are generated from the negative_prompt input argument.output_type (str, *optional*, defaults to pil):The output format of the generated image. Choose between PIL.Image or np.array.return_dict (bool, *optional*, defaults to True):Whether or not to return a [HunyuanVideoPipelineOutput] instead of aplain tuple.cross_attention_kwargs (dict, *optional*):A kwargs dictionary that if specified is passed along to the [AttentionProcessor] as defined in[self.processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).guidance_rescale (float, *optional*, defaults to 0.0):Guidance rescale factor from [Common Diffusion Noise Schedules and Sample Steps areFlawed](https://arxiv.org/pdf/2305.08891.pdf). Guidance rescale factor should fix overexposure whenusing zero terminal SNR.clip_skip (int, *optional*):Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means thatthe output of the pre-final layer will be used for computing the prompt embeddings.callback_on_step_end (Callable, PipelineCallback, MultiPipelineCallbacks, *optional*):A function or a subclass of PipelineCallback or MultiPipelineCallbacks that is called at the end ofeach denoising step during the inference. with the following arguments: callback_on_step_end(self:DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict). callback_kwargs will include alist of all tensors as specified by callback_on_step_end_tensor_inputs.callback_on_step_end_tensor_inputs (List, *optional*):The list of tensor inputs for the callback_on_step_end function. The tensors specified in the listwill be passed as callback_kwargs argument. You will only be able to include variables listed in the._callback_tensor_inputs attribute of your pipeline class.Examples:Returns:[~HunyuanVideoPipelineOutput] or tuple:If return_dict is True, [HunyuanVideoPipelineOutput] is returned,otherwise a tuple is returned where the first element is a list with the generated images and thesecond element is a list of bools indicating whether the corresponding generated image containsnot-safe-for-work (nsfw) content.# 处理与回调函数相关的参数同时对已弃用的参数发出警告deprecation warnings。# 它还检查了新的回调函数机制 callback_on_step_end 是否符合预期类型。callback  kwargs.pop(callback, None)callback_steps  kwargs.pop(callback_steps, None)if callback is not None:deprecate(callback,1.0.0,Passing callback as an input argument to __call__ is deprecated, consider using callback_on_step_end,)if callback_steps is not None:deprecate(callback_steps,1.0.0,Passing callback_steps as an input argument to __call__ is deprecated, consider using callback_on_step_end,)if isinstance(callback_on_step_end, (PipelineCallback, MultiPipelineCallbacks)):callback_on_step_end_tensor_inputs  callback_on_step_end.tensor_inputs# 0. Default height and width to unet# height  height or self.transformer.config.sample_size * self.vae_scale_factor# width  width or self.transformer.config.sample_size * self.vae_scale_factor# to deal with lora scaling and other possible forward hooks# 1. 验证输入参数是否合法。self.check_inputs(prompt,height,width,video_length,callback_steps, # 回调频率指定在生成过程中每隔多少步执行一次回调。negative_prompt,prompt_embeds, # 预嵌入的提示词和反向提示词。如果已经对文本进行了嵌入处理可以直接传递这些值而不是原始文本。negative_prompt_embeds,callback_on_step_end_tensor_inputs, # 与回调机制相关的数据张量。vae_vervae_ver, # 可选参数可能指定生成内容时使用的 VAE变分自动编码器的版本。)# 控制生成内容的引导强度。一般用于调整模型对 prompt提示词的依赖程度。较大的值会让生成内容更接近提示词但可能导致丢失多样性。self._guidance_scale  guidance_scale# 用于重新调整指导比例可能是对 guidance_scale 的一种动态调整。用于平衡模型在特定生成任务中的表现。self._guidance_rescale  guidance_rescale# 控制是否在 CLIP 模型中跳过某些层的计算。在某些生成任务中跳过部分层可以改善生成质量。self._clip_skip  clip_skip# 与交叉注意力Cross Attention相关的参数。可能包括对注意力权重的控制比如调整注意力机制如何在提示词和生成内容之间分配权重。self._cross_attention_kwargs  cross_attention_kwargs# 标志可能在生成过程的某些阶段被动态修改。# 如果 _interrupt 被设置为 True生成过程可能会被中止。这种设计通常用于在用户希望终止长时间生成任务时使用。self._interrupt  False# 2. 根据输入的提示词 prompt 或嵌入 prompt_embeds确定生成任务的批量大小batch_size。if prompt is not None and isinstance(prompt, str):batch_size  1 # 如果 prompt 是单个字符串说明只有一个提示词。批量大小设置为 1。elif prompt is not None and isinstance(prompt, list):# 如果 prompt 是一个列表说明有多个提示词。# 此时批量大小等于提示词的数量即 len(prompt)。batch_size  len(prompt)else:#如果 prompt 是 None说明提示词未提供可能直接使用预先计算的嵌入 prompt_embeds。# 此时批量大小由 prompt_embeds 的第一维通常是样本数量决定。batch_size  prompt_embeds.shape[0]# 确定设备的devicedevice  torch.device(fcuda:{dist.get_rank()}) if dist.is_initialized() else self._execution_device# 3. Encode input prompt# 处理 LoRALow-Rank Adaptation缩放系数通过 cross_attention_kwargs 提取或设置缩放比例 lora_scalelora_scale  (self.cross_attention_kwargs.get(scale, None)if self.cross_attention_kwargs is not Noneelse None)# 对提示词进行编码将文本提示词 prompt 和负向提示词 negative_prompt 编码为嵌入向量并生成对应的注意力掩码。(prompt_embeds, # 正向提示词的嵌入向量。negative_prompt_embeds, # 负向提示词的嵌入向量。prompt_mask, # 正向提示词的注意力掩码。negative_prompt_mask, # 负向提示词的注意力掩码。)  self.encode_prompt(prompt,device,num_videos_per_prompt,self.do_classifier_free_guidance,negative_prompt,prompt_embedsprompt_embeds,attention_maskattention_mask,negative_prompt_embedsnegative_prompt_embeds,negative_attention_masknegative_attention_mask,lora_scalelora_scale,clip_skipself.clip_skip,data_typedata_type,)# 处理多文本编码器若存在额外的文本编码器 text_encoder_2使用该编码器再次处理提示词。if self.text_encoder_2 is not None:(prompt_embeds_2,negative_prompt_embeds_2,prompt_mask_2,negative_prompt_mask_2,)  self.encode_prompt(prompt,device,num_videos_per_prompt,self.do_classifier_free_guidance,negative_prompt,prompt_embedsNone,attention_maskNone,negative_prompt_embedsNone,negative_attention_maskNone,lora_scalelora_scale,clip_skipself.clip_skip,text_encoderself.text_encoder_2,data_typedata_type,)else:prompt_embeds_2  Nonenegative_prompt_embeds_2  Noneprompt_mask_2  Nonenegative_prompt_mask_2  None# 处理自由分类指导Classifier-Free Guidance为实现该技术合并正向和负向提示词嵌入避免多次前向传递。if self.do_classifier_free_guidance:# 功能如果启用了自由分类指导则将正向和负向提示词的嵌入和掩码合并为一个批次。# 原因自由分类指导需要两次前向传递一次处理负向提示词指导无条件生成一次处理正向提示词指导条件生成。# 为了提高效率将两组嵌入拼接在一起作为一个批次传递给模型避免两次单独的前向传递。prompt_embeds  torch.cat([negative_prompt_embeds, prompt_embeds])if prompt_mask is not None:prompt_mask  torch.cat([negative_prompt_mask, prompt_mask])if prompt_embeds_2 is not None:prompt_embeds_2  torch.cat([negative_prompt_embeds_2, prompt_embeds_2])if prompt_mask_2 is not None:prompt_mask_2  torch.cat([negative_prompt_mask_2, prompt_mask_2])# 4. Prepare timesteps#  准备调度器的额外参数extra_set_timesteps_kwargs  self.prepare_extra_func_kwargs(self.scheduler.set_timesteps, {n_tokens: n_tokens})# 获取推理过程中需要用到的时间步 (timesteps) 和推理步数 (num_inference_steps)。timesteps, num_inference_steps  retrieve_timesteps(self.scheduler,num_inference_steps,device,timesteps,sigmas,**extra_set_timesteps_kwargs,)# 根据 vae_ver 调整视频长度if 884 in vae_ver:video_length  (video_length - 1) // 4  1elif 888 in vae_ver:video_length  (video_length - 1) // 8  1else:video_length  video_length# 5. Prepare latent variablesnum_channels_latents  self.transformer.config.in_channelslatents  self.prepare_latents(batch_size * num_videos_per_prompt,num_channels_latents,height,width,video_length,prompt_embeds.dtype,device,generator,latents,)# 6. Prepare extra step kwargs. TODO: Logic should ideally just be moved out of the pipelineextra_step_kwargs  self.prepare_extra_func_kwargs(self.scheduler.step, # 扩散模型的调度器中的 step 方法负责更新噪声预测结果。{generator: generator, eta: eta}, # 一个字典包含生成器 generator 和步长相关参数 eta。)#  确定目标数据类型及自动混合精度的设置target_dtype  PRECISION_TO_TYPE[self.args.precision]autocast_enabled  (target_dtype ! torch.float32) and not self.args.disable_autocast# 确定 VAE 的数据类型及自动混合精度设置vae_dtype  PRECISION_TO_TYPE[self.args.vae_precision]vae_autocast_enabled  (vae_dtype ! torch.float32) and not self.args.disable_autocast# 7. 初始化去噪循环的预处理步骤# timesteps调度器生成的时间步序列。# num_inference_steps推理过程中真正的去噪步数。# self.scheduler.order调度器的阶数通常与预测算法的高阶插值相关。num_warmup_steps  len(timesteps) - num_inference_steps * self.scheduler.orderself._num_timesteps  len(timesteps)# if is_progress_bar:# progress_bar 用于显示推理过程的进度num_inference_steps 是总推理步数。with self.progress_bar(totalnum_inference_steps) as progress_bar:for i, t in enumerate(timesteps):if self.interrupt:continue# 如果启用了 分类器自由指导do_classifier_free_guidance则将 latents 复制两份用于同时计算 条件预测 和 无条件预测。# 否则仅使用原始 latents。latent_model_input  (torch.cat([latents] * 2)if self.do_classifier_free_guidanceelse latents)# 调用 scheduler 的 scale_model_input 方法对 latent_model_input 在当前时间步 t 上进行预处理。# 这个缩放操作可能根据调度器的实现涉及到归一化或其他调整。latent_model_input  self.scheduler.scale_model_input(latent_model_input, t)# t_expand 将时间步 t 扩展到与 latent_model_input 的批量维度一致。# 如果 embedded_guidance_scale 存在则创建扩展的指导参数 guidance_expand用于对模型预测进行额外控制。t_expand  t.repeat(latent_model_input.shape[0])guidance_expand  (torch.tensor([embedded_guidance_scale] * latent_model_input.shape[0],dtypetorch.float32,devicedevice,).to(target_dtype)* 1000.0if embedded_guidance_scale is not Noneelse None)# 使用 Transformer 模型预测噪声残差with torch.autocast(device_typecuda, dtypetarget_dtype, enabledautocast_enabled):noise_pred  self.transformer(  # For an input image (129, 192, 336) (1, 256, 256)latent_model_input,  # 当前的潜变量输入 [2, 16, 33, 24, 42]t_expand,  # 时间步信息 [2]text_statesprompt_embeds,  # 与文本提示相关的嵌入向量 [2, 256, 4096]text_maskprompt_mask,  # [2, 256]text_states_2prompt_embeds_2,  # [2, 768]freqs_cosfreqs_cis[0],  # 频率信息用于特定的时间步缩放 [seqlen, head_dim]freqs_sinfreqs_cis[1],  # [seqlen, head_dim]guidanceguidance_expand, # 指导参数用于条件生成return_dictTrue,)[x]# 分类器自由指导的噪声调整if self.do_classifier_free_guidance:noise_pred_uncond, noise_pred_text  noise_pred.chunk(2) # 无条件预测的噪声条件预测的噪声基于文本提示noise_pred  noise_pred_uncond  self.guidance_scale * (noise_pred_text - noise_pred_uncond)# 噪声重缩放if self.do_classifier_free_guidance and self.guidance_rescale  0.0:# Based on 3.4. in https://arxiv.org/pdf/2305.08891.pdfnoise_pred  rescale_noise_cfg(noise_pred,noise_pred_text,guidance_rescaleself.guidance_rescale,)# 使用调度器更新潜变量# compute the previous noisy sample x_t - x_t-1latents  self.scheduler.step(noise_pred, t, latents, **extra_step_kwargs, return_dictFalse)[0]# callback_on_step_end 函数则在每步结束时调用用于自定义操作如日志记录、结果保存。# 更新潜变量和提示嵌入向量。if callback_on_step_end is not None:callback_kwargs  {}for k in callback_on_step_end_tensor_inputs:callback_kwargs[k]  locals()[k]callback_outputs  callback_on_step_end(self, i, t, callback_kwargs)latents  callback_outputs.pop(latents, latents)prompt_embeds  callback_outputs.pop(prompt_embeds, prompt_embeds)negative_prompt_embeds  callback_outputs.pop(negative_prompt_embeds, negative_prompt_embeds)# 进度条更新与其他回调if i  len(timesteps) - 1 or ((i  1)  num_warmup_steps and (i  1) % self.scheduler.order  0):if progress_bar is not None:progress_bar.update()if callback is not None and i % callback_steps  0:step_idx  i // getattr(self.scheduler, order, 1)callback(step_idx, t, latents)# 从潜变量latent space解码生成图像if not output_type  latent:#  潜变量维度的扩展检查expand_temporal_dim  False# 如果形状为 4D ([batch_size, channels, height, width])# 如果 VAE 是 3D 自回归模型AutoencoderKLCausal3D则对潜变量增加一个时间维度 (unsqueeze(2))。# 设置 expand_temporal_dimTrue标记后续需要移除该额外维度。if len(latents.shape)  4:if isinstance(self.vae, AutoencoderKLCausal3D):latents  latents.unsqueeze(2)expand_temporal_dim  True# 如果形状为 5D ([batch_size, channels, frames, height, width])则不需要操作。elif len(latents.shape)  5:passelse:raise ValueError(fOnly support latents with shape (b, c, h, w) or (b, c, f, h, w), but got {latents.shape}.)# 潜变量的缩放与偏移# 检查 VAE 配置中是否定义了 shift_factor偏移因子if (hasattr(self.vae.config, shift_factor)and self.vae.config.shift_factor): # 如果存在则对潜变量执行缩放和偏移操作latents  (latents / self.vae.config.scaling_factor self.vae.config.shift_factor)else: # 如果 shift_factor 不存在仅进行缩放操作latents  latents / self.vae.config.scaling_factorwith torch.autocast(device_typecuda, dtypevae_dtype, enabledvae_autocast_enabled):if enable_tiling:# 调用 VAE 的 enable_tiling() 方法可能用于解码较大的图像块。self.vae.enable_tiling()# 使用 VAE变分自编码器的 decode 方法将潜变量解码为图像。image  self.vae.decode(latents, return_dictFalse, generatorgenerator)[0]else:image  self.vae.decode(latents, return_dictFalse, generatorgenerator)[0]# 如果添加了时间维度expand_temporal_dimTrue或者解码出的图像在时间维度上只有一个帧则移除时间维度。if expand_temporal_dim or image.shape[2]  1:image  image.squeeze(2)else:image  latents# 图像归一化image  (image / 2  0.5).clamp(0, 1)# 将图像移动到 CPU并转换为 float32 类型。这是为了确保图像兼容性无论之前是否使用了混合精度image  image.cpu().float()# 调用 maybe_free_model_hooks() 方法可能会释放模型占用的内存资源尤其是在内存有限的 GPU 上有用。self.maybe_free_model_hooks()# 如果不需要返回字典return_dictFalse则直接返回处理后的图像if not return_dict:return imagereturn HunyuanVideoPipelineOutput(videosimage) 
3.4 模型结构 
模型结构文件是hyvideo/modules/models.py主要包括双流块、单流块和主干网络。 3.4.1 双流块 MMDoubleStreamBlock 
class MMDoubleStreamBlock(nn.Module):A multimodal dit block with seperate modulation fortext and image/video, see more details (SD3): https://arxiv.org/abs/2403.03206(Flux.1): https://github.com/black-forest-labs/fluxdef __init__(self,hidden_size: int, # 模型隐藏层维度。heads_num: int, # 多头注意力的头数。mlp_width_ratio: float, # MLP 中隐藏层宽度与 hidden_size 的比率。mlp_act_type: str  gelu_tanh, # 激活函数的类型默认 gelu_tanhqk_norm: bool  True, # 是否对 Query 和 Key 启用归一化。qk_norm_type: str  rms, # Query 和 Key 归一化的方法默认 rms。qkv_bias: bool  False, # QKV 投影中是否启用偏置项。dtype: Optional[torch.dtype]  None, # 张量的数据类型和设备。device: Optional[torch.device]  None,):factory_kwargs  {device: device, dtype: dtype}super().__init__()self.deterministic  Falseself.heads_num  heads_numhead_dim  hidden_size // heads_nummlp_hidden_dim  int(hidden_size * mlp_width_ratio)### 图像模态# 模态调制模块使用 ModulateDiT为图像和文本生成 6 组参数shift、scale、gate。self.img_mod  ModulateDiT(hidden_size,factor6,act_layerget_activation_layer(silu),**factory_kwargs,)# 归一化self.img_norm1  nn.LayerNorm(hidden_size, elementwise_affineFalse, eps1e-6, **factory_kwargs)# QKV 投影层通过全连接层计算 Query、Key 和 Valueself.img_attn_qkv  nn.Linear(hidden_size, hidden_size * 3, biasqkv_bias, **factory_kwargs)# 归一化模块qk_norm_layer  get_norm_layer(qk_norm_type)self.img_attn_q_norm  (qk_norm_layer(head_dim, elementwise_affineTrue, eps1e-6, **factory_kwargs)if qk_normelse nn.Identity())self.img_attn_k_norm  (qk_norm_layer(head_dim, elementwise_affineTrue, eps1e-6, **factory_kwargs)if qk_normelse nn.Identity())self.img_attn_proj  nn.Linear(hidden_size, hidden_size, biasqkv_bias, **factory_kwargs)self.img_norm2  nn.LayerNorm(hidden_size, elementwise_affineFalse, eps1e-6, **factory_kwargs)self.img_mlp  MLP(hidden_size,mlp_hidden_dim,act_layerget_activation_layer(mlp_act_type),biasTrue,**factory_kwargs,)### 文本模态self.txt_mod  ModulateDiT(hidden_size,factor6,act_layerget_activation_layer(silu),**factory_kwargs,)self.txt_norm1  nn.LayerNorm(hidden_size, elementwise_affineFalse, eps1e-6, **factory_kwargs)self.txt_attn_qkv  nn.Linear(hidden_size, hidden_size * 3, biasqkv_bias, **factory_kwargs)self.txt_attn_q_norm  (qk_norm_layer(head_dim, elementwise_affineTrue, eps1e-6, **factory_kwargs)if qk_normelse nn.Identity())self.txt_attn_k_norm  (qk_norm_layer(head_dim, elementwise_affineTrue, eps1e-6, **factory_kwargs)if qk_normelse nn.Identity())self.txt_attn_proj  nn.Linear(hidden_size, hidden_size, biasqkv_bias, **factory_kwargs)self.txt_norm2  nn.LayerNorm(hidden_size, elementwise_affineFalse, eps1e-6, **factory_kwargs)self.txt_mlp  MLP(hidden_size,mlp_hidden_dim,act_layerget_activation_layer(mlp_act_type),biasTrue,**factory_kwargs,)self.hybrid_seq_parallel_attn  Nonedef enable_deterministic(self):self.deterministic  Truedef disable_deterministic(self):self.deterministic  Falsedef forward(self,img: torch.Tensor, # 图像张量 (B, L_img, hidden_size)txt: torch.Tensor, # 文本张量 (B, L_txt, hidden_size)vec: torch.Tensor, # 特征向量用于调制cu_seqlens_q: Optional[torch.Tensor]  None,  # Query 的累积序列长度cu_seqlens_kv: Optional[torch.Tensor]  None, # Key/Value 的累积序列长度max_seqlen_q: Optional[int]  None,   # Query 最大序列长度max_seqlen_kv: Optional[int]  None,  # Key/Value 最大序列长度freqs_cis: tuple  None, # 可选的旋转位置编码参数) - Tuple[torch.Tensor, torch.Tensor]:# vec 特征向量通过 ModulateDiT 模块分别为图像和文本模态生成 6 组调制参数(img_mod1_shift,img_mod1_scale,img_mod1_gate,img_mod2_shift,img_mod2_scale,img_mod2_gate,)  self.img_mod(vec).chunk(6, dim-1)(txt_mod1_shift,txt_mod1_scale,txt_mod1_gate,txt_mod2_shift,txt_mod2_scale,txt_mod2_gate,)  self.txt_mod(vec).chunk(6, dim-1)图像模态的前向处理# Layernorm 归一化img_modulated  self.img_norm1(img)# 调制函数 modulate 进行标准化和缩放img_modulated  modulate(img_modulated, shiftimg_mod1_shift, scaleimg_mod1_scale)# 得到 Query、Key 和 Valueimg_qkv  self.img_attn_qkv(img_modulated)img_q, img_k, img_v  rearrange(img_qkv, B L (K H D) - K B L H D, K3, Hself.heads_num)# 对 Query 和 Key 进行归一化。img_q  self.img_attn_q_norm(img_q).to(img_v)img_k  self.img_attn_k_norm(img_k).to(img_v)# 对 Query 和 Key 应用旋转位置编码。if freqs_cis is not None:img_qq, img_kk  apply_rotary_emb(img_q, img_k, freqs_cis, head_firstFalse)assert (img_qq.shape  img_q.shape and img_kk.shape  img_k.shape), fimg_kk: {img_qq.shape}, img_q: {img_q.shape}, img_kk: {img_kk.shape}, img_k: {img_k.shape}img_q, img_k  img_qq, img_kk文本模态的前向处理txt_modulated  self.txt_norm1(txt)txt_modulated  modulate(txt_modulated, shifttxt_mod1_shift, scaletxt_mod1_scale)txt_qkv  self.txt_attn_qkv(txt_modulated)txt_q, txt_k, txt_v  rearrange(txt_qkv, B L (K H D) - K B L H D, K3, Hself.heads_num)# Apply QK-Norm if needed.txt_q  self.txt_attn_q_norm(txt_q).to(txt_v)txt_k  self.txt_attn_k_norm(txt_k).to(txt_v)# 将图像和文本的 Query、Key、Value 拼接q  torch.cat((img_q, txt_q), dim1)k  torch.cat((img_k, txt_k), dim1)v  torch.cat((img_v, txt_v), dim1)assert (cu_seqlens_q.shape[0]  2 * img.shape[0]  1), fcu_seqlens_q.shape:{cu_seqlens_q.shape}, img.shape[0]:{img.shape[0]}# 多模态融合注意力计算if not self.hybrid_seq_parallel_attn:attn  attention(q,k,v,cu_seqlens_qcu_seqlens_q,cu_seqlens_kvcu_seqlens_kv,max_seqlen_qmax_seqlen_q,max_seqlen_kvmax_seqlen_kv,batch_sizeimg_k.shape[0],)else:attn  parallel_attention(self.hybrid_seq_parallel_attn,q,k,v,img_q_lenimg_q.shape[1],img_kv_lenimg_k.shape[1],cu_seqlens_qcu_seqlens_q,cu_seqlens_kvcu_seqlens_kv)# 最终将注意力结果拆分为图像部分 img_attn 和文本部分 txt_attnimg_attn, txt_attn  attn[:, : img.shape[1]], attn[:, img.shape[1] :]图像模态的更新# 将注意力结果通过残差连接更新图像特征并通过 MLP 进一步增强img  img  apply_gate(self.img_attn_proj(img_attn), gateimg_mod1_gate)img  img  apply_gate(self.img_mlp(modulate(self.img_norm2(img), shiftimg_mod2_shift, scaleimg_mod2_scale)),gateimg_mod2_gate,)文本模态的更新txt  txt  apply_gate(self.txt_attn_proj(txt_attn), gatetxt_mod1_gate)txt  txt  apply_gate(self.txt_mlp(modulate(self.txt_norm2(txt), shifttxt_mod2_shift, scaletxt_mod2_scale)),gatetxt_mod2_gate,)# 返回更新后的图像特征和文本特征return img, txt 
3.4.2 单流块 MMSingleStreamBlock 
class MMSingleStreamBlock(nn.Module):A DiT block with parallel linear layers as described inhttps://arxiv.org/abs/2302.05442 and adapted modulation interface.Also refer to (SD3): https://arxiv.org/abs/2403.03206(Flux.1): https://github.com/black-forest-labs/fluxdef __init__(self,hidden_size: int, # 隐藏层的维度大小用于表示特征的维度。heads_num: int, # 注意力头的数量。mlp_width_ratio: float  4.0, # 用于确定多层感知机 (MLP) 的隐藏层宽度比例默认值为 4.0mlp_act_type: str  gelu_tanh, # 激活函数类型qk_norm: bool  True,  # 决定是否对 Query 和 Key 应用归一化qk_norm_type: str  rms, # 指定 Query 和 Key 的归一化方式例如 rms均方根归一化qk_scale: float  None, # 自定义缩放因子用于注意力分数计算中的缩放dtype: Optional[torch.dtype]  None, # 控制数据类型device: Optional[torch.device]  None, # 控制缩放因子):factory_kwargs  {device: device, dtype: dtype}super().__init__()self.deterministic  Falseself.hidden_size  hidden_sizeself.heads_num  heads_numhead_dim  hidden_size // heads_nummlp_hidden_dim  int(hidden_size * mlp_width_ratio)self.mlp_hidden_dim  mlp_hidden_dimself.scale  qk_scale or head_dim ** -0.5# qkv and mlp_inself.linear1  nn.Linear(hidden_size, hidden_size * 3  mlp_hidden_dim, **factory_kwargs)# proj and mlp_outself.linear2  nn.Linear(hidden_size  mlp_hidden_dim, hidden_size, **factory_kwargs)qk_norm_layer  get_norm_layer(qk_norm_type)self.q_norm  (qk_norm_layer(head_dim, elementwise_affineTrue, eps1e-6, **factory_kwargs)if qk_normelse nn.Identity())self.k_norm  (qk_norm_layer(head_dim, elementwise_affineTrue, eps1e-6, **factory_kwargs)if qk_normelse nn.Identity())self.pre_norm  nn.LayerNorm(hidden_size, elementwise_affineFalse, eps1e-6, **factory_kwargs)self.mlp_act  get_activation_layer(mlp_act_type)()self.modulation  ModulateDiT(hidden_size,factor3,act_layerget_activation_layer(silu),**factory_kwargs,)self.hybrid_seq_parallel_attn  Nonedef enable_deterministic(self):self.deterministic  Truedef disable_deterministic(self):self.deterministic  Falsedef forward(self,x: torch.Tensor, # x: 输入特征张量形状为 (batch_size, seq_len, hidden_size)vec: torch.Tensor, # vec: 辅助特征向量通常来自调制器txt_len: int, # txt_len: 文本序列长度用于区分图像和文本部分。cu_seqlens_q: Optional[torch.Tensor]  None,  # 累积序列长度用于高效的分段注意力计算。cu_seqlens_kv: Optional[torch.Tensor]  None,max_seqlen_q: Optional[int]  None,# Query 和 Key/Value 的最大序列长度。max_seqlen_kv: Optional[int]  None,freqs_cis: Tuple[torch.Tensor, torch.Tensor]  None, # 可选的旋转位置编码RoPE) - torch.Tensor:# 调用 modulation 获取调制参数 mod_shift、mod_scale 和 mod_gate。mod_shift, mod_scale, mod_gate  self.modulation(vec).chunk(3, dim-1)# 对输入 x 应用 LayerNorm并进行调制即元素级缩放和偏移x_mod  modulate(self.pre_norm(x), shiftmod_shift, scalemod_scale)# 将 x_mod 映射到 qkv 和 mlp 两个部分。qkv, mlp  torch.split(self.linear1(x_mod), [3 * self.hidden_size, self.mlp_hidden_dim], dim-1)# qkv 被分为 Query (q)、Key (k)、Value (v) 三个张量形状为 (batch_size, seq_len, heads_num, head_dim)。q, k, v  rearrange(qkv, B L (K H D) - K B L H D, K3, Hself.heads_num)# 对 Query 和 Key 应用归一化。q  self.q_norm(q).to(v)k  self.k_norm(k).to(v)# 旋转位置编码 (RoPE)if freqs_cis is not None:img_q, txt_q  q[:, :-txt_len, :, :], q[:, -txt_len:, :, :]img_k, txt_k  k[:, :-txt_len, :, :], k[:, -txt_len:, :, :]# 分别对图像和文本部分应用旋转位置编码img_qq, img_kk  apply_rotary_emb(img_q, img_k, freqs_cis, head_firstFalse)assert (img_qq.shape  img_q.shape and img_kk.shape  img_k.shape), fimg_kk: {img_qq.shape}, img_q: {img_q.shape}, img_kk: {img_kk.shape}, img_k: {img_k.shape}img_q, img_k  img_qq, img_kk# 图像部分和文本部分的 Query/Key 在编码后重新拼接。q  torch.cat((img_q, txt_q), dim1)k  torch.cat((img_k, txt_k), dim1)# Compute attention.assert (cu_seqlens_q.shape[0]  2 * x.shape[0]  1), fcu_seqlens_q.shape:{cu_seqlens_q.shape}, x.shape[0]:{x.shape[0]}# attention computation startif not self.hybrid_seq_parallel_attn:# 如果没有启用并行注意力机制调用标准注意力函数 attentionattn  attention(q,k,v,cu_seqlens_qcu_seqlens_q,cu_seqlens_kvcu_seqlens_kv,max_seqlen_qmax_seqlen_q,max_seqlen_kvmax_seqlen_kv,batch_sizex.shape[0],)else:# 否则使用并行注意力机制 parallel_attentionattn  parallel_attention(self.hybrid_seq_parallel_attn,q,k,v,img_q_lenimg_q.shape[1],img_kv_lenimg_k.shape[1],cu_seqlens_qcu_seqlens_q,cu_seqlens_kvcu_seqlens_kv)# attention computation end# 将注意力结果和 MLP 激活结果拼接通过线性层投影回输入维度。output  self.linear2(torch.cat((attn, self.mlp_act(mlp)), 2))# 使用 mod_gate 进行门控融合将残差连接后的结果返回。return x  apply_gate(output, gatemod_gate) 
3.4.3 混元主干网络HYVideoDiffusionTransformer 
class HYVideoDiffusionTransformer(ModelMixin, ConfigMixin):HunyuanVideo Transformer backbone该类继承了 ModelMixin 和 ConfigMixin使其与 diffusers 库的采样器例如 StableDiffusionPipeline兼容ModelMixin: 来自 diffusers 的模块提供了模型的保存和加载功能。ConfigMixin: 使模型能够以字典形式保存和加载配置信息。Reference:[1] Flux.1: https://github.com/black-forest-labs/flux[2] MMDiT: http://arxiv.org/abs/2403.03206Parameters----------args: argparse.Namespace传入的命令行参数用于设置模型的配置。patch_size: list输入特征的分块尺寸。一般用于图像或视频的分块操作。in_channels: int输入数据的通道数如 RGB 图像为 3 通道。out_channels: int模型输出的通道数。hidden_size: intTransformer 模块中隐藏层的维度。heads_num: int多头注意力机制中的注意力头数量通常用来分配不同的注意力特征。mlp_width_ratio: floatMLP多层感知机中隐藏层维度相对于 hidden_size 的比例。mlp_act_type: strMLP 使用的激活函数类型例如 ReLU、GELU 等。depth_double_blocks: int双 Transformer 块的数量。双块可能是指包含多层结构的单元。depth_single_blocks: int单 Transformer 块的数量。rope_dim_list: list为时空维度t, h, w设计的旋转位置编码ROPE的维度。qkv_bias: bool是否在 QKV查询、键、值线性层中使用偏置项。qk_norm: bool是否对 Q 和 K 应用归一化。qk_norm_type: strQK 归一化的类型。guidance_embed: bool是否使用指导嵌入guidance embedding来支持蒸馏训练。text_projection: str文本投影类型默认为 single_refiner可能用于文本引导的视频生成。use_attention_mask: bool是否在文本编码器中使用注意力掩码。dtype: torch.dtype模型参数的数据类型例如 torch.float32 或 torch.float16。device: torch.device模型的运行设备如 CPU 或 GPU。register_to_configdef __init__(self,args: Any, #patch_size: list  [1, 2, 2],in_channels: int  4,  # Should be VAE.config.latent_channels.out_channels: int  None,hidden_size: int  3072,heads_num: int  24,mlp_width_ratio: float  4.0,mlp_act_type: str  gelu_tanh,mm_double_blocks_depth: int  20,mm_single_blocks_depth: int  40,rope_dim_list: List[int]  [16, 56, 56],qkv_bias: bool  True,qk_norm: bool  True,qk_norm_type: str  rms,guidance_embed: bool  False,  # For modulation.text_projection: str  single_refiner,use_attention_mask: bool  True,dtype: Optional[torch.dtype]  None,device: Optional[torch.device]  None,):# 用来传递设备和数据类型如torch.float32的参数方便后续模块的初始化。factory_kwargs  {device: device, dtype: dtype}super().__init__()self.patch_size  patch_sizeself.in_channels  in_channelsself.out_channels  in_channels if out_channels is None else out_channelsself.unpatchify_channels  self.out_channels # 用来重新拼接patch时的通道数。self.guidance_embed  guidance_embedself.rope_dim_list  rope_dim_list# Text projection. Default to linear projection.# Alternative: TokenRefiner. See more details (LI-DiT): http://arxiv.org/abs/2406.11831self.use_attention_mask  use_attention_maskself.text_projection  text_projectionself.text_states_dim  args.text_states_dimself.text_states_dim_2  args.text_states_dim_2# 确保每个头的维度是整数。if hidden_size % heads_num ! 0:raise ValueError(fHidden size {hidden_size} must be divisible by heads_num {heads_num})pe_dim  hidden_size // heads_num# 确保位置嵌入的维度与Transformer头的维度一致。if sum(rope_dim_list) ! pe_dim:raise ValueError(fGot {rope_dim_list} but expected positional dim {pe_dim})self.hidden_size  hidden_sizeself.heads_num  heads_num# 将输入图像分割为小块patch并映射到Transformer的隐藏空间hidden_size。# 每个patch相当于一个Transformer的输入token。self.img_in  PatchEmbed(self.patch_size, self.in_channels, self.hidden_size, **factory_kwargs)# 根据text_projection参数选择不同的文本投影方式# TextProjection线性投影直接将文本特征映射到模型隐藏空间。if self.text_projection  linear:self.txt_in  TextProjection(self.text_states_dim,self.hidden_size,get_activation_layer(silu),**factory_kwargs,)# SingleTokenRefiner使用小型Transformer深度为2对文本特征进行细化处理。elif self.text_projection  single_refiner:self.txt_in  SingleTokenRefiner(self.text_states_dim, hidden_size, heads_num, depth2, **factory_kwargs)else:raise NotImplementedError(fUnsupported text_projection: {self.text_projection})# TimestepEmbedder时间步嵌入模块输入时间信息例如视频帧的索引并嵌入到Transformer隐藏空间。self.time_in  TimestepEmbedder(self.hidden_size, get_activation_layer(silu), **factory_kwargs)# text modulation# MLPEmbedder用于处理来自文本或其他辅助信息的特征并投影到隐藏空间。self.vector_in  MLPEmbedder(self.text_states_dim_2, self.hidden_size, **factory_kwargs)# guidance_in引导嵌入模块用于处理额外的控制信号如扩散模型中的引导提示。self.guidance_in  (TimestepEmbedder(self.hidden_size, get_activation_layer(silu), **factory_kwargs)if guidance_embedelse None)# MMDoubleStreamBlock多模态双流块融合了图像流和文本流信息。self.double_blocks  nn.ModuleList([MMDoubleStreamBlock(self.hidden_size,self.heads_num,mlp_width_ratiomlp_width_ratio,mlp_act_typemlp_act_type,qk_normqk_norm,qk_norm_typeqk_norm_type,qkv_biasqkv_bias,**factory_kwargs,)for _ in range(mm_double_blocks_depth)])# MMSingleStreamBlock单流块用于进一步处理多模态融合后的单一流特征。self.single_blocks  nn.ModuleList([MMSingleStreamBlock(self.hidden_size,self.heads_num,mlp_width_ratiomlp_width_ratio,mlp_act_typemlp_act_type,qk_normqk_norm,qk_norm_typeqk_norm_type,**factory_kwargs,)for _ in range(mm_single_blocks_depth)])# FinalLayer将Transformer隐藏空间中的token重新解码为图像patch并还原到完整图像的分辨率。self.final_layer  FinalLayer(self.hidden_size,self.patch_size,self.out_channels,get_activation_layer(silu),**factory_kwargs,)# 分别在模型中的 双流模块double_blocks 和 单流模块single_blocks 中启用或禁用确定性行为。def enable_deterministic(self):# 在深度学习中启用确定性行为意味着模型在同样的输入和参数初始化条件下无论多少次运行都能产生相同的输出结果。for block in self.double_blocks:block.enable_deterministic()for block in self.single_blocks:block.enable_deterministic()def disable_deterministic(self):# 禁用确定性行为可能会允许使用非确定性的操作如某些高效的并行实现从而提升计算效率。for block in self.double_blocks:block.disable_deterministic()for block in self.single_blocks:block.disable_deterministic()def forward(self,x: torch.Tensor, # 输入图像张量形状为 (N, C, T, H, W)。批量大小为 N通道数为 C时间步为 T高度和宽度为 H 和 W。t: torch.Tensor,  # 时间步张量用于时间嵌入。范围应为 [0, 1000]可能对应扩散模型或时间相关的特征。text_states: torch.Tensor  None, # 文本嵌入表示与图像配对的文本特征。text_mask: torch.Tensor  None,  # 文本掩码张量可选。当前未使用可能用于控制哪些文本特征参与计算。text_states_2: Optional[torch.Tensor]  None,  # 额外的文本嵌入用于进一步调制modulation。在模型中可能是辅助的文本特征表示freqs_cos: Optional[torch.Tensor]  None, # 正弦和余弦频率用于位置编码或调制。freqs_sin: Optional[torch.Tensor]  None,guidance: torch.Tensor  None,  # 引导调制强度形状可能是 cfg_scale x 1000。通常用于引导生成如扩散模型的分类引导。return_dict: bool  True, # 是否返回一个字典结果。默认为 True。) - Union[torch.Tensor, Dict[str, torch.Tensor]]:out  {}img  xtxt  text_states_, _, ot, oh, ow  x.shape# 得到划分patch后的t,h,wtt, th, tw  (ot // self.patch_size[0],oh // self.patch_size[1],ow // self.patch_size[2],)# Prepare modulation vectors.# 时间嵌入通过 self.time_in(t) 提取特征。vec  self.time_in(t)# text modulation# 如果有额外文本嵌入 text_states_2则通过 self.vector_in 模块对 vec 进行调制。vec  vec  self.vector_in(text_states_2)# 启用了引导调制self.guidance_embed通过 self.guidance_in 引入引导特征。if self.guidance_embed:if guidance is None:raise ValueError(Didnt get guidance strength for guidance distilled model.)# our timestep_embedding is merged into guidance_in(TimestepEmbedder)vec  vec  self.guidance_in(guidance)# Embed image and text.# 图像嵌入img  self.img_in(img)# 文本嵌入if self.text_projection  linear: # 线性投影txt  self.txt_in(txt)elif self.text_projection  single_refiner: # 结合时间步 t 和文本掩码进行更复杂的处理。txt  self.txt_in(txt, t, text_mask if self.use_attention_mask else None)else:raise NotImplementedError(fUnsupported text_projection: {self.text_projection})txt_seq_len  txt.shape[1]img_seq_len  img.shape[1]# 计算序列长度和累积序列索引# 用于 Flash Attention 的高效计算cu_seqlens_* 和 max_seqlen_* 控制序列长度和最大长度。# Compute cu_squlens and max_seqlen for flash attentioncu_seqlens_q  get_cu_seqlens(text_mask, img_seq_len)cu_seqlens_kv  cu_seqlens_qmax_seqlen_q  img_seq_len  txt_seq_lenmax_seqlen_kv  max_seqlen_qfreqs_cis  (freqs_cos, freqs_sin) if freqs_cos is not None else None# --------------------- Pass through DiT blocks ------------------------for _, block in enumerate(self.double_blocks):double_block_args  [img,txt,vec,cu_seqlens_q,cu_seqlens_kv,max_seqlen_q,max_seqlen_kv,freqs_cis,]# 并行处理图像和文本信息使用输入参数包括嵌入和序列长度等逐步更新 img 和 txt。img, txt  block(*double_block_args)# 合并图像和文本并通过单流模块x  torch.cat((img, txt), 1)if len(self.single_blocks)  0:for _, block in enumerate(self.single_blocks):single_block_args  [x,vec,txt_seq_len,cu_seqlens_q,cu_seqlens_kv,max_seqlen_q,max_seqlen_kv,(freqs_cos, freqs_sin),]x  block(*single_block_args)# 分离图像特征img  x[:, :img_seq_len, ...]# ---------------------------- Final layer ------------------------------# 图像特征通过 final_layer 提取最终结果img  self.final_layer(img, vec)  # (N, T, patch_size ** 2 * out_channels)# 通过 unpatchify 恢复到原始分辨率。img  self.unpatchify(img, tt, th, tw)if return_dict:out[x]  imgreturn outreturn imgdef unpatchify(self, x, t, h, w):# 是将被切分为小块patches的特征重新还原成原始的张量形状通常用于图像处理任务中# 例如在 ViTVision Transformer模型的输出阶段将 patch 还原为完整图像的形式。x: (N, T, patch_size**2 * C)  批量大小时间帧数每个patch中的通道数imgs: (N, H, W, C)c  self.unpatchify_channelspt, ph, pw  self.patch_sizeassert t * h * w  x.shape[1]x  x.reshape(shape(x.shape[0], t, h, w, c, pt, ph, pw))x  torch.einsum(nthwcopq-nctohpwq, x)imgs  x.reshape(shape(x.shape[0], c, t * pt, h * ph, w * pw))return imgsdef params_count(self):# 计算模型的参数数量并将其按类别进行统计。它返回一个包含不同类别参数数量的字典通常用于分析模型的规模或复杂度。counts  {double: sum( # double_blocks 模块的所有参数数量[sum(p.numel() for p in block.img_attn_qkv.parameters()) sum(p.numel() for p in block.img_attn_proj.parameters()) sum(p.numel() for p in block.img_mlp.parameters()) sum(p.numel() for p in block.txt_attn_qkv.parameters()) sum(p.numel() for p in block.txt_attn_proj.parameters()) sum(p.numel() for p in block.txt_mlp.parameters())for block in self.double_blocks]),single: sum( # single_blocks 模块的所有参数数量[sum(p.numel() for p in block.linear1.parameters()) sum(p.numel() for p in block.linear2.parameters())for block in self.single_blocks]),total: sum(p.numel() for p in self.parameters()),}# double 和 single 参数的总和主要聚焦于注意力和 MLP 层。counts[attnmlp]  counts[double]  counts[single]return counts