提高大型语言模型 (LLM) 性能的四种数据清理技术--演示：清理 GAI 文本输入

最编程 2024-04-09 16:40:52

...

让我们通过一个例子将它们放在一起。在此演示中，我们使用 ChatGPT 在两位技术人员之间生成对话。我们将在对话中应用基本的清洁技术，以展示这些实践如何实现可靠且一致的结果。

synthetic_text = """
Sarah (S): Technology Enthusiast
Mark (M): AI Expert
S: Hey Mark! How's it going? Heard about the latest advancements in Generative AI (GA)?
M: Hey Sarah! Yes, I've been diving deep into the realm of GA lately. It's fascinating how it's shaping the future of technology!
S: Absolutely! I mean, GA has been making waves across various industries. What do you think is driving its significance?
M: Well, GA, especially Retrieval Augmented Generative (RAG), is revolutionizing content generation. It's not just about regurgitating information anymore; it's about creating contextually relevant and engaging content.
S: Right! And with Machine Learning (ML) becoming more sophisticated, the possibilities seem endless.
M: Exactly! With advancements in ML algorithms like GPT (Generative Pre-trained Transformer), we're seeing unprecedented levels of creativity in AI-generated content.
S: But what about concerns regarding bias and ethics in GA?
M: Ah, the age-old question! While it's true that GA can inadvertently perpetuate biases present in the training data, there are techniques like Adversarial Training (AT) that aim to mitigate such issues.
S: Interesting! So, where do you see GA headed in the next few years?
M: Well, I believe we'll witness a surge in applications leveraging GA for personalized experiences. From virtual assistants to content creation tools, GA will become ubiquitous in our daily lives.
S: That's exciting! Imagine AI-powered virtual companions tailored to our preferences.
M: Indeed! And with advancements in Natural Language Processing (NLP) and computer vision, these virtual companions will be more intuitive and lifelike than ever before.
S: I can't wait to see what the future holds!
M: Agreed! It's an exciting time to be in the field of AI.
S: Absolutely! Thanks for sharing your insights, Mark.
M: Anytime, Sarah. Let's keep pushing the boundaries of Generative AI together!
S: Definitely! Catch you later, Mark!
M: Take care, Sarah!
"""

第 1 步：基本清理

首先，我们从对话中删除表情符号、主题标签和 Unicode 字符。

# Sample text with emojis, hashtags, and unicode characters

# Tokenization
tokens = word_tokenize(synthetic_text)

# Remove Noise
cleaned_tokens = [re.sub(r'[^\w\s]', '', token) for token in tokens]

# Normalization (convert to lowercase)
cleaned_tokens = [token.lower() for token in cleaned_tokens]

# Remove Stopwords
stop_words = set(stopwords.words('english'))
cleaned_tokens = [token for token in cleaned_tokens if token not in stop_words]

# Lemmatization
lemmatizer = WordNetLemmatizer()
cleaned_tokens = [lemmatizer.lemmatize(token) for token in cleaned_tokens]

print(cleaned_tokens)

第 2 步：准备我们的提示

接下来，我们将制作一个提示，要求模型根据从我们的综合对话中收集的信息作为友好的客户服务代理进行响应。

MESSAGE_SYSTEM_CONTENT = "You are a customer service agent that helps 
a customer with answering questions. Please answer the question based on the
provided context below. 
Make sure not to make any changes to the context if possible,
when prepare answers so as to provide accurate responses. If the answer 
cannot be found in context, just politely say that you do not know, 
do not try to make up an answer."

第 3 步：准备交互

让我们准备与模型的交互。在此示例中，我们将使用 GPT-4。

def response_test(question:str, context:str, model:str = "gpt-4"):
    response = client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "system",
                "content": MESSAGE_SYSTEM_CONTENT,
            },
            {"role": "user", "content": question},
            {"role": "assistant", "content": context},
        ],
    )
    
    return response.choices[0].message.content

第 4 步：准备问题

最后，让我们向模型提出一个问题，并比较清洁前后的结果。

question1 = "What are some specific techniques in Adversarial Training (AT) 
that can help mitigate biases in Generative AI models?"

在清洁之前，我们的模型会生成以下响应：

response = response_test(question1, synthetic_text)
print(response)

#Output
# I'm sorry, but the context provided doesn't contain specific techniques in Adversarial Training (AT) that can help mitigate biases in Generative AI models.

清理后，模型会生成以下响应。通过基本清洁技术增强理解，该模型可以提供更彻底的答案。

response = response_test(question1, new_content_string)
print(response)
#Output:
# The context mentions Adversarial Training (AT) as a technique that can 
# help mitigate biases in Generative AI models. However, it does not provide 
#any specific techniques within Adversarial Training itself.

上一篇：语义分割列 (I) 解读 FCN

下一篇：在 javascript 中，"!function{}"是什么意思？