This article explores how large language models (LLMs) are changing the way we build AI-powered products and the landscape of machine learning operations (MLOps).
本文探讨了大型语言模型 (LLM) 如何改变我们构建 AI 驱动的产品的方式以及机器学习操作 (MLOps) 的格局。

OpenAI’s ChatGPT has opened Pandora’s box of large language models (LLMs) in production. Not only does your neighbor now bother you with small talk about artificial intelligence, but the machine learning (ML) community is talking about yet another new term: “LLMOps.”

LLMs are changing the way we build and maintain AI-powered products. This will lead to new sets of tools and best practices for the lifecycle of LLM-powered applications.

This article will first explain the newly emerged term “LLMOps” and its background. We’ll discuss how building AI products is different with LLMs than with classical ML models. And based on these differences, we’ll look at the how MLOps varies from LLMOps. Finally, we’ll discuss what developments we can expect in the LLMOps space in the near future.
本文将首先解释新出现的术语”LLMOps”及其背景。我们将讨论使用 LLM 构建 AI 产品与使用经典 ML 模型有何不同。基于这些差异,我们将了解 MLOps 与 LLMOps 有何不同。最后,我们将讨论在不久的将来我们可以期待LLMOps领域的发展。


The term LLMOps stands for Large Language Model Operations. The short definition is that LLMOps is MLOps for LLMs. That means that LLMOps is, essentially, a new set of tools and best practices to manage the lifecycle of LLM-powered applications, including development, deployment, and maintenance.
术语 LLMOps 代表大型语言模型操作。简短的定义是LLMOps是LLM的MLOps。这意味着LLMOps本质上是一套新的工具和最佳实践,用于管理LLM驱动的应用程序的生命周期,包括开发,部署和维护。

When we say that “LLMOps is MLOps for LLMs”, we need to define the terms LLMs and MLOps first:
当我们说”LLMOps 是 LLM 的 MLOps “时,我们需要首先定义术语 LLM 和 MLOps:

  • LLMs (large language models) are deep learning models that can generate outputs in human language (and are thus called language models). The models have billions of parameters and are trained on billions of words (and are thus called large language models).

  • MLOps (machine learning operations) is a set of tools and best practices to manage the lifecycle of ML-powered applications.
    MLOps(机器学习操作)是一组工具和最佳做法,用于管理 ML 驱动的应用程序的生命周期。

With that out of the way, let’s dig in a bit further.


Early LLMs like BERT and GPT-2 have been around since 2018. Yet we are just now - almost five years later - experiencing a meteoric rise of the idea of LLMOps. The main reason is that LLMs gained much media attention with the release of ChatGPT in December 2022.
像BERT和GPT-2这样的早期LLM自2018年以来一直存在。然而,我们刚刚 - 将近五年后 - 正在经历LLMOps概念的迅速崛起。主要原因是 LLM 在 2022 年 12 月发布 ChatGPT 时引起了媒体的广泛关注。

Since then, we have seen many different applications leveraging the power of LLMs, such as:

  • Chatbots ranging from the famous ChatGPT to more intimate and personal ones (e.g., Michelle Huang chatting with her childhood self),
    聊天机器人的范围从著名的ChatGPT到更亲密和个性化的聊天机器人(例如,Michelle Huang与她童年的自己聊天),

  • Writing assistants for editing or summarization (e.g., Notion AI) to specialized ones for copywriting (e.g., Jasper and copy.ai) or contracting (e.g., lexion),
    用于编辑或总结的写作助手(例如,概念AI)到用于文案写作(例如,Jasper和 copy.ai)或承包(例如,lexion)的写作助手,

  • Programming assistants from writing and debugging code (e.g., GitHub Copilot), to testing it (e.g., Codium AI), to finding security threats (e.g., Socket AI),
    编程助手,从编写和调试代码(例如,GitHub Copilot),到测试它(例如,Codium AI),再到发现安全威胁(例如,Socket AI),

With many people developing and bringing LLM-powered applications to production, people are sharing their experiences:

“It’s easy to make something cool with LLMs, but very hard to make something production-ready with them.” - Chip Huyen [2]

It’s become clear that building production-ready LLM-powered applications comes with its own set of challenges, different from building AI products with classical ML models. To tackle these challenges, we need to develop new tools and best practices to manage the LLM application lifecycle. Thus, we see an increased use of the term “LLMOps.”

LLMOps 涉及哪些步骤?

The steps involved in LLMOps are in some ways similar to MLOps. However, the steps of building an LLM-powered application differ due to the emergence of foundation models. Instead of training LLMs from scratch, the focus lies on adapting pre-trained LLMs to downstream tasks.
LLMOps 中涉及的步骤在某些方面类似于 MLOps。但是,由于基础模型的出现,构建LLM驱动的应用程序的步骤有所不同。与其从头开始训练LLM,重点是使预先训练的LLM适应下游任务。

Already over a year ago, Andrej Karpathy [3] described how the process of building AI products will change in the future:
早在一年多前,Andrej Karpathy[3]就描述了构建AI产品的过程在未来将如何变化:

But the most important trend […] is that the whole setting of training a neural network from scratch on some target task […] is quickly becoming outdated due to finetuning, especially with the emergence of foundation models like GPT. These foundation models are trained by only a few institutions with substantial computing resources, and most applications are achieved via lightweight finetuning of part of the network, prompt engineering, or an optional step of data or model distillation into smaller, special-purpose inference networks. […] - Andrej Karpathy [3]
但最重要的趋势是,由于微调,从头开始训练神经网络的整个设置[…]由于微调而迅速过时,特别是随着GPT等基础模型的出现。这些基础模型仅由少数具有大量计算资源的机构进行训练,大多数应用程序都是通过对网络的一部分进行轻量级微调、快速工程或将数据或模型提炼成更小的专用推理网络的可选步骤来实现的。[…]- 安德烈-卡尔帕西 [3]

This quote may be overwhelming the first time you read it. But it summarizes everything that has been going on precisely, so let’s unpack it step by step in the following subsections.

Step 1: Selection of a foundation model\

第 1 步:选择基础模型

Foundation models are LLMs pre-trained on large amounts of data that can be used for a wide range of downstream tasks. Because training a foundation model from scratch is complicated, time-consuming, and extremely expensive, only a few institutions have the required training resources [3].

Just to put it into perspective: According to a study from Lambda Labs in 2020, training OpenAI’s GPT-3 (with 175 billion parameters) would require 355 years and $4.6 million using a Tesla V100 cloud instance.
只是从这个角度来看:根据 Lambda Labs 在 2020 年的一项研究,使用 Tesla V100 云实例训练 OpenAI 的 GPT-3(具有 1750 亿个参数)将需要 355 年和 460 万美元。

AI is currently going through what the community is calling its “Linux moment”. Currently, developers have to choose between two types of foundation models based on a trade-off between performance, cost, ease of use, and flexibility: Proprietary models or open-source models.
人工智能目前正在经历社区称之为”Linux时刻”的事情。 目前,开发人员必须根据性能、成本、易用性和灵活性之间的权衡,在两种类型的基础模型之间进行选择:专有模型或开源模型。

Proprietary and open-source foundation models (Image by the author, inspired by Fiddler.ai)\

Proprietary models are closed-source foundation models owned by companies with large expert teams and big AI budgets. They usually are larger than open-source models and have better performance. They are also off-the-shelf and generally rather easy to use.
专有模型是拥有大型专家团队和大量 AI 预算的公司拥有的闭源基础模型。它们通常比开源模型大,并且具有更好的性能。它们也是现成的,通常相当易于使用。

The main downside of proprietary models is their expensive APIs (application programming interfaces). Additionally, closed-source foundation models offer less or no flexibility for adaption for developers.

Examples of proprietary model providers are:

Open-source models are often organized and hosted on HuggingFace as a community hub. Usually, they are smaller models with lower capabilities than proprietary models. But on the upside, they are more cost-effective than proprietary models and offer more flexibility for developers.

Examples of open-source models are:

Step 2: Adaptation to downstream tasks\

第 2 步:适应下游任务

Once you have chosen your foundation model, you can access the LLM through its API. If you are used to working with other APIs, working with LLM APIs will initially feel a little strange because it is not always clear what input will cause what output beforehand. Given any text prompt, the API will return a text completion, attempting to match your pattern.
选择基础模型后,您可以通过其API访问LLM。如果您习惯于使用其他API,那么使用LLM API最初会感到有点奇怪,因为事先并不总是清楚什么输入会导致什么输出。给定任何文本提示,API 将返回文本完成,尝试匹配您的模式。

Here is an example of how you would use the OpenAI API. You give the API input as a prompt, e.g., prompt = “Correct this to standard English:\n\nShe no went to the market.”.
以下是如何使用OpenAI API的示例。您提供 API 输入作为提示,例如,提示 =”将其更正为标准英语:\n\n她没有进入市场。

import openai

openai.api_key = ...

response = openai.Completion.create(

engine = "text-davinci-003",

prompt = "Correct this to standard English:\n\nShe no went to the market.",

# ...


The API will output a response containing the completion response[‘choices’][0][‘text’] = “She did not go to the market.”
API 将输出一个包含完成响应的响应[‘选择’][0][‘文本’] = “她没有去市场”。

The main challenge is that LLMs aren’t almighty despite being powerful and thus, the key question is: How do you get an LLM to give the output you want?

One concern respondents mentioned in the LLM in production survey [4] was model accuracy and hallucinations. That means getting the output from the LLM API in your desired format might take some iterations, and also, LLMs can hallucinate if they don’t have the required specific knowledge. To combat these concerns, you can adapt the foundation models to downstream tasks in the following ways:
LLM生产调查[4]中提到的一个受访者关注的是模型的准确性和幻觉。这意味着以您想要的格式从LLM API获取输出可能需要一些迭代,而且,如果LLM没有所需的特定知识,他们可能会产生幻觉。为了解决这些问题,您可以通过以下方式使基础模型适应下游任务:

  • Prompt Engineering [2, 3, 5] is a technique to tweak the input so that the output matches your expectations. You can use different tricks to improve your prompt (see OpenAI Cookbook). One method is to provide some examples of the expected output format. This is similar to a zero-shot or few-shot learning setting [5]. Tools like LangChain or HoneyHive have already emerged to help you manage and version your prompt templates [1].
    提示工程 [2, 3, 5] 是一种调整输入以使输出符合您的期望的技术。您可以使用不同的技巧来改进提示(请参阅 OpenAI 食谱 )。一种方法是提供一些预期输出格式的示例。这类似于零镜头或少镜头学习设置 [5]。像LangChain或HoneyHive这样的工具已经出现,可以帮助你管理和版本控制你的提示模板[1]。

Prompt engineering (Image by the author inspired by Chip Huyen [c])
提示工程(图片来源:作者灵感来自Chip Huyen [c])

  • Fine-tuning pre-trained models [2, 3, 5] is a known technique in ML. It can help improve your model’s performance on your specific task. Although this will increase the training efforts, it can reduce the cost of inference. The cost of LLM APIs is dependent on input and output sequence length. Thus, reducing the number of input tokens, reduces API costs because you don’t have to provide examples in the prompt anymore [2].
    微调预训练模型 [2, 3, 5] 是 ML 中已知的技术。它可以帮助提高模型在特定任务中的性能。虽然这将增加训练工作量,但它可以降低推理的成本。LLM API 的成本取决于输入和输出序列长度。因此,减少输入令牌的数量可以降低 API 成本,因为您不再需要在提示中提供示例 [2]。

Fine-tuning LLMs (Image by the author inspired by Chip Huyen [2])
微调LLM(图片来源于Chip Huyen [2])

  • External data: Foundation models often lack contextual information (e.g., access to some specific documents or emails) and can become outdated quickly (e.g., GPT-4 was trained on data before September 2021). Because LLMs can hallucinate if they don’t have sufficient information, we need to be able to give them access to relevant external data. There are already tools, such as LlamaIndex (GPT Index), LangChain, or DUST, available that can act as central interfaces to connect (“chaining”) LLMs to other agents and external data [1].
    外部数据:基础模型通常缺乏上下文信息(例如,访问某些特定文档或电子邮件),并且可能很快过时(例如,GPT-4 在 2021 年 9 月之前进行了数据训练)。因为LLM如果没有足够的信息就会产生幻觉,我们需要能够让他们访问相关的外部数据。已经有工具,如LlamaIndex(GPT索引),LangChain或DUST,可以作为中央接口,将LLM连接到(”链接”)到其他代理和外部数据[1]。

  • Embeddings: Another way is to extract information in the form of embeddings from LLM APIs (e.g., movie summaries or product descriptions) and build applications on top of them (e.g., search, comparison, or recommendations). If np.array is not sufficient to store your embeddings for long-term memory, you can use vector databases such as Pinecone, Weaviate, or Milvus [1].
    嵌入:另一种方法是从LLM API中提取嵌入形式的信息(例如,电影摘要或产品描述),并在其上构建应用程序(例如,搜索,比较或推荐)。如果 np.array 不足以存储长期记忆的嵌入,您可以使用矢量数据库,如 松果 、 Weaviate 或 Milvus [1]。

  • Alternatives: As this field is rapidly evolving, there are many more ways LLMs can be leveraged in AI products. Some examples are instruction tuning/prompt tuning and model distillation [2, 3].

Step 3: Evaluation 第 3 步:评估

In classical MLOps, ML models are validated on a hold-out validation set [5] with a metric that indicates the models’ performance. But how do you evaluate the performance of an LLM? How do you decide whether a response is good or bad? Currently, it seems like organizations are A/B testing their models [5].
在经典 MLOps 中,ML 模型在保留验证集 [5] 上进行验证,并使用指示模型性能的指标。但是您如何评估LLM的性能?您如何确定响应是好是坏?目前,组织似乎正在A / B测试他们的模型[5]。

To help evaluate LLMs, tools like HoneyHive or HumanLoop have emerged.


Step 4: Deployment and Monitoring\

步骤 4:部署和监视

The completions of LLMs can drastically change between releases [2]. For example, OpenAI has updated its models to mitigate inappropriate content generation e.g., hate speech. As a result, searching for the phrase “as an AI language model” on Twitter now reveals countless bots.

This showcases that building LLM-powered applications require monitoring of the changing in the underlying API model.
这表明构建 LLM 驱动的应用程序需要监视底层 API 模型中的变化。

There are already tools for monitoring LLMs emerging, such as Whylabs or HumanLoop.

How Is LLMOps Different Than MLOps?
LLMOps 与 MLOps 有何不同?

The differences between MLOps and LLMOps are caused by the differences in how we build AI products with classical ML models versus LLMs. The differences mainly affect data management, experimentation, evaluation, cost, and latency.
MLOps 和 LLMOps 之间的差异是由于我们使用经典 ML 模型与 LLM 构建 AI 产品的方式不同造成的。这些差异主要影响数据管理、试验、评估、成本和延迟。

Data Management 数据管理

In classical MLOps, we are used to data-hungry ML models. Training a neural network from scratch requires a lot of labeled data, and even fine-tuning a pre-trained model requires at least a few hundred samples. Although data cleaning is integral to the ML development process, we know and accept that large datasets have imperfections.
在经典 MLOps 中,我们习惯于数据饥渴的 ML 模型。从头开始训练神经网络需要大量标记数据,甚至微调预训练模型也至少需要几百个样本。尽管数据清理是 ML 开发过程中不可或缺的一部分,但我们知道并接受大型数据集存在缺陷。

In LLMOps, fine-tuning is similar to MLOps. But prompt engineering is a zero-shot or few-shot learning setting. That means we have few but hand-picked samples [5].

Experimentation 实验

In MLOps, experimentation looks similar whether you train a model from scratch or fine-tune a pre-trained one. In both cases, you will track inputs, such as model architecture, hyperparameters, and data augmentations, and outputs, such as metrics.
在 MLOps 中,无论是从头开始训练模型还是微调预先训练的模型,试验看起来都类似。在这两种情况下,您都将跟踪输入(例如模型体系结构、超参数和数据增强)和输出(例如指标)。

But in LLMOps, the question is whether to prompt engineer or to fine-tune [2, 5]. Although fine-tuning will look similar in LLMOps to MLOps, prompt engineering requires a different experimentation setup including management of prompts.
但在LLMOps中,问题是是提示工程师还是微调[2,5]。尽管 LLMOps 中的微调与 MLOps 相似,但提示工程需要不同的试验设置,包括提示管理。

Evaluation 评估

In classical MLOps, a model’s performance is evaluated on a hold-out validation set [5] with an evaluation metric. Because the performance of LLMs is more difficult to evaluate, currently organizations seem to be using A/B testing [5].
在经典 MLOps 中,使用评估指标在维持验证集 [5] 上评估模型的性能。由于LLM的性能更难评估,目前组织似乎正在使用A / B测试[5]。


While the cost of traditional MLOps usually lies in data collection and model training, the cost of LLMOps lies in inference [2]. Although we can expect some costs from using expensive APIs during experimentation [5], Chip Huyen [2] shows that the cost of long prompts is in inference.
传统MLOps的成本通常在于数据收集和模型训练,而LLMOps的成本在于推理[2]。虽然我们可以预期在实验期间使用昂贵的API会产生一些成本[5],但Chip Huyen[2]表明长提示的成本是推理中的。


Another concern respondents mentioned in the LLM in production survey [4] was latency. The completion length of an LLM significantly affects latency [2]. Although latency concerns have to be considered in MLOps as well, they are much more prominent in LLMOps because this is a big issue for the experimentation velocity during development [5] and the user experience in production.
LLM生产调查[4]中提到的另一个受访者关注的问题是如何延迟。LLM 的完成长度会显著影响延迟 [2]。尽管在 MLOps 中也必须考虑延迟问题,但它们在 LLMOps 中更为突出,因为这对于开发期间的实验速度 [5] 和生产中的用户体验来说是一个大问题。

The Future of LLMOps LLMOps 的未来

LLMOps is an emerging field. With the speed this space is evolving, making any predictions is difficult. It is even unsure if the term “LLMOps” is here to stay. We are only sure that we will see a lot of new use cases of LLMs and tools and best practices to manage the LLM lifecycle.

The field of AI is rapidly evolving, potentially making anything we write now outdated in a month. We’re still in the early stages of bringing LLM-powered applications to production. There are many questions we don’t have the answers to, and only time will tell how things will play out:

  • Is the term “LLMOps” is here to stay?

  • How will LLMOps in light of MLOps evolve? Will they morph together or will they become separate sets of operations?

  • How will AI’s “Linux moment” will play out?

We can say with certainty that we expect to see many developments and new toolings and best practices appear soon. Also, we are already seeing efforts being made towards cost and latency reduction for foundation models [2]. These are definitely interesting times!
我们可以肯定地说,我们希望很快看到许多发展,新的工具和最佳实践将出现。此外,我们已经看到正在努力降低基础模型的成本和延迟 [2]。这绝对是有趣的时代!


Since the release of OpenAI’s ChatGPT, LLMs are currently a hot topic in the field of AI. These deep learning models can generate outputs in human language, making them a powerful tool for tasks such as conversational AI, writing assistants, and programming assistants.

However, bringing LLM-powered applications to production presents its own set of challenges, which has led to the emergence of a new term, “LLMOps”. It refers to the set of tools and best practices used to manage the lifecycle of LLM-powered applications, including development, deployment, and maintenance.

LLMOps can be seen as a subcategory of MLOps. However, the steps involved in building an LLM-powered application differ from those in building applications with classical ML models.
LLMOps 可以看作是 MLOps 的一个子类别。但是,构建 LLM 驱动的应用程序所涉及的步骤与使用经典 ML 模型构建应用程序所涉及的步骤不同。

Rather than training an LLM from scratch, the focus is on adapting pre-trained LLMs to downstream tasks. This involves selecting a foundation model, using LLMs in downstream tasks, evaluating them, and deploying and monitoring the model.

While LLMOps is still a relatively new field, it is expected to continue to develop and evolve as LLMs become more prevalent in the AI industry. Overall, the rise of LLMs and LLMOps represents a significant shift in building and maintaining AI-powered products.



  • ChatGPT Flutter 应用程序开发
  • ChatGPT 4 和 Midjourney提示工程
  • OpenAI Python API 训练营
  • 使用Django创建ChatGPT AI 机器人
  • ChatGPT Javascript开发教程


[1] D. Hershey and D. Oppenheimer (2023). DevTools for language models - predicting the future (accessed April 14th, 2023)

[2] C. Huyen (2023). Building LLM applications for production (accessed April 16th, 2023)

[3] A. Karpathy (2022). Deep Neural Nets: 33 years ago and 33 years from now (accessed April 17th, 2023).

[4] MLOps Community (2023). LLM in production responses (accessed April 19th, 2023)

[5] S. Shankar (2023). Twitter Thread (accessed April 14th, 2023)