DALLE 3技术分析 - 训练方式/模型结构

帖子未结置顶 0 26849

er_eh1cl LV1 2023年11月4日 11:12 编辑

<section id="nice" style="font-size: 16px; padding-right: 10px; padding-left: 10px; word-break: break-word; overflow-wrap: break-word; line-height: 1.25; font-family: Optima-Regular, Optima, PingFangTC-Light, PingFangSC-light, PingFangTC-light; letter-spacing: 2px; background-image: linear-gradient(90deg, rgba(50, 0, 0, 0.05) 3%, rgba(0, 0, 0, 0) 3%), linear-gradient(360deg, rgba(50, 0, 0, 0.05) 3%, rgba(0, 0, 0, 0) 3%); background-size: 20px 20px; background-position: center center;"><h1 style="margin-top: 30px; margin-bottom: 15px; font-weight: bold; font-size: 25px;"><span style="display: inline-block; color: rgb(119, 48, 152);">DALLE 3技术分析 - 训练方式/模型结构</span></h1> <h2 style="font-weight: bold; font-size: 22px; margin-top: 20px; margin-right: 10px; margin-bottom: 0px;"><span style="font-size: 18px; display: inline-block; padding-left: 10px; border-left: 5px solid rgb(145, 109, 213);">1. 引言:</span></h2> <p style="padding-top: 8px; padding-bottom: 8px; line-height: 26px; font-size: 14px; word-spacing: 2px;">从 DALLE 3 开发者技术轨迹中，以及模型的演示视频，我们可以推导 DALLE 3 模型的某些架构信息。</p> <h2 style="font-weight: bold; font-size: 22px; margin-top: 20px; margin-right: 10px; margin-bottom: 0px;"><span style="font-size: 18px; display: inline-block; padding-left: 10px; border-left: 5px solid rgb(145, 109, 213);">2. DALLE 2 的评价:</span></h2> <p style="padding-top: 8px; padding-bottom: 8px; line-height: 26px; font-size: 14px; word-spacing: 2px;">DALLE 2 的性能不佳，主要归因于 CLIP 模型的限制。</p> <p style="padding-top: 8px; padding-bottom: 8px; line-height: 26px; font-size: 14px; word-spacing: 2px;">CLIP 在为后续的 diffusion model 提供充足内容和详细特征上遇到了困难。</p> <p style="padding-top: 8px; padding-bottom: 8px; line-height: 26px; font-size: 14px; word-spacing: 2px;">在生成详细图像方面，该模型遇到了显著的挑战。</p> <h2 style="font-weight: bold; font-size: 22px; margin-top: 20px; margin-right: 10px; margin-bottom: 0px;"><span style="font-size: 18px; display: inline-block; padding-left: 10px; border-left: 5px solid rgb(145, 109, 213);">3. GPT 模型的作用:</span></h2> <p style="padding-top: 8px; padding-bottom: 8px; line-height: 26px; font-size: 14px; word-spacing: 2px;">之前的实验使用 GPT 2 作为音频/视觉媒体的核心处理系统，任务是解释人类的文本输入并将其转化为 diffusion model 的视觉表示。</p> <p style="padding-top: 8px; padding-bottom: 8px; line-height: 26px; font-size: 14px; word-spacing: 2px;">该基于 GPT 2 模型的性能超越了其众多同时代的模型，使得这种策略看起来是可行的。</p> <p style="padding-top: 8px; padding-bottom: 8px; line-height: 26px; font-size: 14px; word-spacing: 2px;">对于 DALLE 3，作为自回归核心的 GPT 模型的确切版本，是 GPT 3 还是 GPT 4，尚未确定。但为了此次分析，我们假设使用了 GPT 4。</p> <h2 style="font-weight: bold; font-size: 22px; margin-top: 20px; margin-right: 10px; margin-bottom: 0px;"><span style="font-size: 18px; display: inline-block; padding-left: 10px; border-left: 5px solid rgb(145, 109, 213);">4. GPT 4 的图像解读:</span></h2> <p style="padding-top: 8px; padding-bottom: 8px; line-height: 26px; font-size: 14px; word-spacing: 2px;">几个月前，GPT 4 的图像解读能力已经显著提高，但 OpenAI 并未公之于众。</p> <p style="padding-top: 8px; padding-bottom: 8px; line-height: 26px; font-size: 14px; word-spacing: 2px;">从商业角度来看，OpenAI 可能没有足够的计算资源进行图像解释。这引起了一个问题：计算能力被引导到哪里？</p> <p style="padding-top: 8px; padding-bottom: 8px; line-height: 26px; font-size: 14px; word-spacing: 2px;">随着 DALLE 3 的发布，我们猜测 GPT 4 的图像能力被用于生成适合 DALLE 3 的训练数据。</p> <p style="padding-top: 8px; padding-bottom: 8px; line-height: 26px; font-size: 14px; word-spacing: 2px;">GPT 4 的图像模型的架构可能采用与 BLIP2/mini GPT 4 相似的方法。这可能包括一个额外的视觉编码器(VIT)和几个转换层（例如 Qformer）来将图像转换为模型可以理解的格式。</p> <p style="padding-top: 8px; padding-bottom: 8px; line-height: 26px; font-size: 14px; word-spacing: 2px;">预计 OpenAI 的 visual encoder/decoder 是自行训练的，可能导致更好的结果。</p> <h2 style="font-weight: bold; font-size: 22px; margin-top: 20px; margin-right: 10px; margin-bottom: 0px;"><span style="font-size: 18px; display: inline-block; padding-left: 10px; border-left: 5px solid rgb(145, 109, 213);">5. GPT 4 图像发布延迟的可能原因:</span></h2> <p style="padding-top: 8px; padding-bottom: 8px; line-height: 26px; font-size: 14px; word-spacing: 2px;">GPT 4 图像版本发布之久的可能原因：服务器被用于生产 image-text pair 数据集。</p> <p style="padding-top: 8px; padding-bottom: 8px; line-height: 26px; font-size: 14px; word-spacing: 2px;">有了充足的数据，自然就为 DALLE 3 的创造铺平了道路。</p> <h2 style="font-weight: bold; font-size: 22px; margin-top: 20px; margin-right: 10px; margin-bottom: 0px;"><span style="font-size: 18px; display: inline-block; padding-left: 10px; border-left: 5px solid rgb(145, 109, 213);">6. DALLE 3 的假设结构和训练:</span></h2> <p style="padding-top: 8px; padding-bottom: 8px; line-height: 26px; font-size: 14px; word-spacing: 2px;">OpenAI 首先训练了一个高效的 visual encoder/decoder。</p> <p style="padding-top: 8px; padding-bottom: 8px; line-height: 26px; font-size: 14px; word-spacing: 2px;">之后，他们可能采用与 miniGPT 4 类似的方法来训练 GPT 4 进行图像处理。</p> <p style="padding-top: 8px; padding-bottom: 8px; line-height: 26px; font-size: 14px; word-spacing: 2px;">拥有了图像能力的 GPT 4 之后，可以生成一个全面的 image-text pair 数据集。这也可能是图像(image tokens)到文本(text tokens)的格式。</p> <p style="padding-top: 8px; padding-bottom: 8px; line-height: 26px; font-size: 14px; word-spacing: 2px;">text tokens 到 image tokens 的配对可能被用于培养 DALLE 的主要部分，我们暂时称之为“GPT 4 image creator”。</p> <p style="padding-top: 8px; padding-bottom: 8px; line-height: 26px; font-size: 14px; word-spacing: 2px;">接下来的步骤可能涉及将 image tokens 转换回图像。目前，diffusion model 在此任务上表现出色，甚至超过了原生 decoder。</p> <p style="padding-top: 8px; padding-bottom: 8px; line-height: 26px; font-size: 14px; word-spacing: 2px;">我们猜测使用了一个 diffusion decoder 进行图像生成。</p> <h2 style="font-weight: bold; font-size: 22px; margin-top: 20px; margin-right: 10px; margin-bottom: 0px;"><span style="font-size: 18px; display: inline-block; padding-left: 10px; border-left: 5px solid rgb(145, 109, 213);">7. 最后的话:</span></h2> <p style="padding-top: 8px; padding-bottom: 8px; line-height: 26px; font-size: 14px; word-spacing: 2px;">此分析避免深入到模型的复杂细节，例如模型之间是否有潜在的 residual 结构，或 text tokens 是否与 image tokens 同时输入到 diffusion model 中。确定这些细节需要实际的实验操作。另外由于已经写的太长了，其中部分基于视频内容的推理暂时没有解释。作者：一点小小的 AI 震撼 https://www.bilibili.com/read/cv26647930/ 出处：bilibili</p> </section>

DALLE 3技术分析 - 训练方式/模型结构

友情链接