Controlnet的理解1——引言和相关工作
文章目录一、前言二、摘要1. 引言2. 相关工作2.1. 微调神经网络一、前言参考资料Controlnet原论文Adding Conditional Control to Text-to-Image Diffusion Models论文地址https://arxiv.org/pdf/2302.05543ControlNet的作者团队由斯坦福大学的博士生张吕敏Lvmin Zhang领衔另两位合作者分别是现香港科技大学的助理教授饶安逸Anyi Rao和斯坦福大学的Maneesh Agrawala教授。 核心作者张吕敏Lvmin Zhang个人代号昵称是 “lllyasviel” 因其卓越贡献被网友尊称为AI界的 “赛博佛祖” 。教育背景出生于中国2021年在苏州大学获得工学学士学位2022年起进入斯坦福大学攻读计算机科学博士学位师从Maneesh Agrawala教授。技术起点大一就发表了AI绘画相关论文本科期间已在ICCV/CVPR/ECCV等顶级会议发表了10篇论文。代表作除了ControlNet还开发了Style2Paints、Fooocus、IC-Light、LayerDiffusion等知名项目。 其他作者介绍饶安逸 (Anyi Rao)团队的重要成员同时也是ControlNet论文的合著、者。他现任香港科技大学 (HKUST) 助理教授领导“多媒体创意实验室”。Maneesh Agrawala斯坦福大学Forest Baskett讲席教授也是张吕敏在斯坦福的博士生导师。他是麦克阿瑟天才奖和ACM Fellow得主在人机交互与计算机图形学领域享有极高声誉。二、摘要We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained textto-image diffusion models. ControlNet locks theproductionready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backboneto learn a diverse set of conditional controls.我们提出了ControlNet一种将空间条件控制添加到大型预训练文本到图像扩散模型中的神经网络架构。ControlNet锁定了生产就绪的大型扩散模型并重用了它们通过数十亿张图像预训练的深度而强大的编码层作为学习多样化条件控制集的强大骨干。The neural architecture is connected with “zero convolutions” (zero-initialized convolution layers) thatprogressively grow the parameters from zeroand ensure that no harmful noise could affect the finetuning. We test various conditioning controls, e.g., edges, depth, segmentation, human pose, etc., with Stable Diffusion, using single or multiple conditions, with or without prompts.该神经网络架构通过“零卷积”零初始化的卷积层进行连接这些卷积层可以逐步增加参数量并确保没有有害噪声会影响微调。我们使用 Stable Diffusion 测试了各种条件控制例如边缘、深度、分割、人体姿态等支持单个或多个条件有或无提示词。We show that the training of ControlNets is robust with small (50k) and large (1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.我们证明了ControlNets的训练对于小型50k和大型1m数据集都是鲁棒的。广泛的结果表明ControlNet可能有助于将图像扩散模型推广到更广泛的应用中。1. 引言Many of us have experiencedflashes of visual inspirationthat we wish to capture ina unique image.With the advent oftext-to-image diffusion models [54, 62, 72], we can now createvisually stunning imagesby typing in a text prompt.我们中的许多人都曾经历过瞬间的视觉灵感希望能将其捕捉成一幅独特的图像。随着文本到图像扩散模型[54, 62, 72]的出现现在我们只需输入文本提示就能创造出视觉上令人惊叹的图像。Yet, text-to-image models arelimited in the controlthey provideover the spatial composition of the image; precisely expressing complex layouts, poses, shapes andformscan be difficult via text prompts alone. Generating an image that accurately matches ourmental imageryoften requiresnumerous trial-and-error cyclesof editing a prompt,inspectingthe resulting images and then re-editing the prompt.然而文本到图像模型在对图像空间构图的控制方面存在局限性仅通过文本提示很难精确表达复杂的布局、姿势、形状和形态。要生成与我们心中所想图像精确匹配的图像通常需要经过多次反复试验和错误的过程编辑提示词、检查生成的图像然后再次编辑提示词。Can we enable finer grained spatial control by letting users provide additional images that directly specify their desired image composition? In computer vision and machine learning, these additional images (e.g., edge maps, human pose skeletons,segmentation maps, depth,normals, etc.) are often treated as conditioning on the image generation process.我们能否通过允许用户提供直接指定其所需图像构成的额外图像从而实现更精细的空间控制在计算机视觉和机器学习中这些额外的图像例如边缘图、人体姿态骨骼、分割图、深度图、法线图等通常被视为图像生成过程的条件。Image-to-image translation models [34, 98] learn the mapping from conditioning images to target images. The research community has alsotaken steps tocontrol text-to-image models withspatial masks[6, 20], image editinginstructions[10],personalization via finetuning[21, 75], etc. While a few problems (e.g.,generating image variations,inpainting) canbe resolved withtraining-free techniqueslike constraining the denoising diffusion process or editing attention layer activations,a wider variety of problemslike depth-to-image, pose-to-image, etc., require end-to-end learning and>注注spatially localized空间上局部化的 → 意思是指这种输入条件只影响图像的特定区域比如只对图像左边或某个物体位置施加控制而不是全局conduct进行2. 相关工作2.1. 微调神经网络One way to finetune a neural network is to directly continue training it with the additional training data. But this approach can lead to overfitting,mode collapse, and catastrophic forgetting. Extensive research has focused on developing finetuning strategies that avoid such issues.微调神经网络的一种方法是直接使用额外的训练数据继续训练它。但是这种方法可能导致过拟合、模式崩溃和灾难性遗忘。大量的研究集中在开发避免这些问题的微调策略上。HyperNetwork is an approach thatoriginated inthe Natural Language Processing (NLP) community [25], with the aim of training a small recurrent neural network to influence the weights of a larger one. It has been applied to image generation with generative adversarial networks (GANs) [4, 18].超网络HyperNetwork是一种起源于自然语言处理NLP社区的方法 [25]旨在训练一个小型循环神经网络来影响一个大型神经网络的权重。它已被应用于生成对抗网络GANs的图像生成中 [4, 18]。[25] David Ha, Andrew M. Dai, and Quoc V. Le. Hypernetworks.In International Conference on Learning Representations,2017. 2Heathen et al. [26] and Kurumuz [43] implement HyperNetworks for Stable Diffusion [72] to change the artistic style of its output images.Heathen 等人 [26] 和 Kurumuz [43] 为 Stable Diffusion [72] 实现超网络HyperNetworks以改变其输出图像的艺术风格。Adapter methods are widely used in NLP for customizing a pretrained transformer model to other tasks by embedding new module layers into it [30, 84]. In computer vision, adapters are used for incremental learning [74] and domain adaptation [70]. This technique is often used with CLIP [66] for transferring pretrained backbone models to different tasks [23, 66, 85, 94].在自然语言处理中适配器方法通过将新的模块层嵌入预训练的 Transformer 模型中广泛用于将该模型定制到其他任务 [30, 84]。在计算机视觉领域适配器被用于增量学习 [74] 和领域自适应 [70]。该技术常与 CLIP [66] 结合使用将预训练的骨干模型迁移到不同任务 [23, 66, 85, 94]。More recently, adapters have yielded successful results in vision transformers [49, 50] and ViT-Adapter [14]. Inconcurrent workwith ours, T2IAdapter [56]adaptsStable Diffusiontoexternal conditions.近期适配器在视觉 Transformer [49, 50] 和 ViT-Adapter [14] 中取得了成功。在我们同期进行的工作中T2IAdapter [56] 将 Stable Diffusion 适配到外部条件。[14] Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, TongLu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions. International Conference on Learning Representations, 2023. 2Additive Learningcircumvents forgettingby freezing the original model weights and adding a small number of new parameters usinglearned weight masks[51, 74],pruning[52], or hard attention [80].Side-Tuning[92] uses a side branch model tolearn extra functionalitybylinearly blendingthe outputs of a frozen model and an added network, with a predefined blending weight schedule.增量学习通过冻结原始模型权重并使用学习到的权重掩码[51, 74]、剪枝[52]或硬注意力[80]来添加少量新参数从而避免遗忘。Side-Tuning[92] 用一个侧分支模型来学习额外功能。它把冻结的原模型输出和新增的侧分支输出按一定权重线性混合其中混合权重按照预设的计划变化比如训练初期原模型占比大后期侧分支占比大。注Additive Learning一种方法的名字“加法式学习”circumvents forgetting绕过/避免遗忘 → 也就是解决“灾难性遗忘”问题学新任务时把旧任务忘了learned weight masks学出来的权重掩码 → 决定新参数的哪些位置被激活像一个可学习的开关pruning剪枝 → 通常指删掉不重要的参数但在这里的语境下可能是先剪掉一些原模型的参数然后在空出来的位置上加新参数或者理解为只训练一部分稀疏的新连接hard attention硬注意力 → 一种二进制的注意力要么完全保留某个神经元要么完全忽略用来决定新参数加在哪儿[51] Arun Mallya, Dillon Davis, and Svetlana Lazebnik. Piggyback: Adapting a single network to multiple tasks by learning to mask weights. In European Conference on Computer Vision (ECCV), pages 67–82, 2018. 2[74] Amir Rosenfeld and John K Tsotsos. Incremental learning through deep adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(3):651–663, 2018. 2[52] Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018. 2[80] Joan Serra, Didac Suris, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophic forgetting with hard attention to the task. In International Conference on Machine Learning, pages 4548–4557. PMLR, 2018. 2[92] Jeffrey O. Zhang, Alexander Sax, Amir Zamir, Leonidas J. Guibas, and Jitendra Malik. Side-tuning: Network adaptation via additive side networks. In European Conference on Computer Vision (ECCV), pages 698–714. Springer, 2020. 2Low-Rank Adaptation (LoRA) prevents catastrophic forgetting [31] by learningthe offset of parameterswithlowrank matrices, based on the observation that many overparameterized modelsresidein alow intrinsic dimension subspace[2, 47].低秩适应 (LoRA) 通过学习低秩矩阵的参数偏移来防止灾难性遗忘 [31]这是基于许多过度参数化的模型存在于低内在维度子空间中的观察结果。 [2, 47]。注offset of parameters参数的偏移量。即“原参数应该调整多少”而不是直接替换原参数。low-rank matrices低秩矩阵。一种数学上紧凑的表示方式参数量很少。→ LoRA 通过使用低秩矩阵来学习参数的调整量而不是直接改原参数从而防止遗忘。为什么这样做能防止遗忘因为 LoRA 不修改原模型。原模型的参数被冻结frozen一直保留着旧知识。新知识被存在那些“低秩矩阵”里学完之后再加到原参数上。原参数始终不变 → 旧知识永远在 → 不会遗忘。为什么可以用低秩矩阵overparameterized models过参数化的模型参数量远多于必要量的模型比如大语言模型、扩散模型。low intrinsic dimension subspace低内在维度的子空间。通俗解释虽然模型有几十亿参数但真正有效的“自由度”其实很少。好比一个高维数据比如人脸照片实际上可以用很少的几个特征眼睛间距、鼻子形状等来近似表示。observation研究者观察到的现象。→ 既然模型真正有效的变化空间很小那么“参数的调整量”也可以在这个小空间里完成用低秩矩阵就足够表达想要的更新。reside “存在于”、“位于”更通俗的说法是 “处于……空间之中”。它表示 “这些模型的参数虽然处在高维空间但它们的有效性实际上局限在一个低维子空间里”。完整通俗翻译“LoRA 通过低秩矩阵来学习原模型参数的调整值以此防止灾难性遗忘。这样做的基础是很多参数量巨大的模型实际上内在的有效维度很低——也就是说你只需要在很小的子空间里调整参数就能达到不错的微调效果。更口语化“LoRA 不直接改原模型的参数而是另外加两个小矩阵低秩矩阵来记录‘原参数要改多少’。因为很多大模型其实‘真实需要的自由度’很少所以用这些小矩阵就能学得很好这样既不会忘记旧知识又省显存。”一个直观类比帮助你真正“理解”想象你有一本写满知识的大百科全书原模型全参数微调直接在原书上修改字句。可能改着改着旧的内容就被覆盖或删掉了遗忘。LoRA你在书旁边贴便利贴便利贴上只写“改动差异”比如“第3页第五行改为……”。便利贴很小低秩而且原书内容完好无损。阅读时你同时看原书 便利贴上的改动就可以了。为什么便利贴够用因为原书已经很好了你只需要在一个很小的改动范围内修正它对特定任务的表现不需要重写整本书。[31] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu,Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021. 2[2] Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 7319–7328, Online, Aug. 2021. Association for Computational Linguistics. 3[47] Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension of objective landscapes. International Conference on Learning Representations, 2018. 3Zero-Initialized Layers are used by ControlNet for connecting network blocks.Research on neural networkshasextensively discussedtheinitialization and manipulationof network weights [36, 37, 44, 45, 46, 76, 83, 95]. For example, Gaussian initialization of weights can be less risky than initializing with zeros [1].ControlNet 使用零初始化层来连接网络块。关于神经网络的研究已广泛讨论了网络权重的初始化和操作 [36, 37, 44, 45, 46, 76, 83, 95]。例如与使用零初始化相比高斯初始化风险较低 [1]。注Zero-Initialized Layers权重全部初始化为0的网络层。ControlNet用它来连接不同的网络块。→ ControlNet 在连接主模型和可训练副本时用了把权重设为0的层。→ 一般情况下用高斯初始化比用零初始化更安全。为什么零初始化有风险因为如果一层所有权重都是0那么它的输出也是0反向传播时所有神经元的梯度相同导致它们永远无法学到不同的特征对称性问题。所以通常不建议全零初始化。这两者之间的“矛盾”怎么理解一般人会觉得零初始化有风险那ControlNet为什么还要用这正是ControlNet的创意所在ControlNet的零初始化层并不是独立训练的层而是作为连接桥梁。训练开始时零初始化的层输出为0 → 相当于“桥是断开的”可训练副本对原模型完全没有影响。随着训练进行权重从0逐渐被更新控制信号从无到有、平滑引入。这样就不会在训练初期用随机的、未学习的控制信号破坏预训练模型已经学好的特征。所以一般的风险是“所有神经元对称学不动” → 但ControlNet恰恰希望一开始什么都不学输出为0然后逐步介入。这是有意设计不是缺陷。一句话总结这段话的核心意思虽然传统研究认为零初始化有风险不如高斯初始化安全但ControlNet却专门用零初始化来连接网络块——这样训练开始时新增部分为零不会干扰预训练模型之后才逐渐学会施加控制。如果还不清楚看这个类比想象你在给一辆已经调校好的赛车加装一个辅助动力装置常见的零初始化问题如果直接把辅助动力装置接上但所有参数都是0那它一开始不工作而且学起来很慢。ControlNet的做法故意让连接点一开始是“断开”的输出0等装好了再慢慢接通动力。这样不会在安装过程中搞坏原来的引擎。More recently, Nichol et al. [59] discussed how to scale the initial weight of convolution layers in a diffusion model to improve the training, and their implementation of “zero module” is an extreme case to scale weights to zero. Stability’s model cards [83] also mention the use of zero weights in neural layers.近期Nichol 等人 [59] 讨论了如何缩放扩散模型中卷积层的初始权重以改进训练他们实现的“零模块”是将权重缩放到零的极端情况。Stability 的模型卡 [83] 也提到了在神经网络层中使用零权重。注这段话的目的是为 ControlNet 使用“零初始化”提供背景支持和合理性论证告诉读者“这种做法不是我们瞎搞的别人也用过”。scale the initial weight缩放初始权重。即把原本随机初始化的权重乘以一个系数比如 0.5, 0.1, 0.0 等。“zero module”他们实现的一种模块名字就叫“零模块”。extreme case to scale weights to zero缩放权重的极端情况——直接缩到 0。→ Nichol 等人的工作里为了改善训练会缩放卷积层的初始权重。其中一种极端做法就是把权重设为零他们称之为“zero module”。隐含逻辑把权重设为零并不是一个疯狂的、不可行的操作反而是“缩放权重”这个连续操作中的一种自然边界情况。ControlNet 就是用了这个“边界情况”。→ Stability AI 的公开技术文档中也承认他们用过零权重。Manipulating the initial convolution weights is also discussed in ProGAN [36], StyleGAN [37], and Noise2Noise [46].在ProGAN [36]、StyleGAN [37]和Noise2Noise [46]中也讨论了对初始卷积权重的操作。