2024-03
大语言模型是良好的动作识别器
分类: 计算机视觉和模式识别
作者: Haoxuan Qu, Yujun Cai, Jun Liu
发布时间: 2024-03-31
链接: http://arxiv.org/abs/2404.00532v1
摘要: 基于骨架的动作识别引起了许多研究的关注。最近,为了构建精确的基于骨架的动作识别器,已经提出了多种工作。其中,一些作品使用大型模型架构作为识别器的骨干,以提高骨架数据表示能力,而其他一些作品则根据外部数据对识别器进行预训练以丰富知识。在这项工作中,我们观察到广泛应用于各种自然语言处理任务的大型语言模型通常同时拥有大型模型架构和丰富的隐式知识。受此启发,我们提出了一种新颖的 LLM-AR 框架,在该框架中我们研究将大型语言模型视为动作识别器。在我们的框架中,我们提出了一种语言投影过程,将每个输入动作信号(即每个骨架序列)投影为其“句子格式”(即“动作句子”)。此外,我们还将我们的框架与多种设计结合起来,以进一步促进这种语言投射过程。大量的实验证明了我们提出的框架的有效性。
DivTOD:释放大语言模型的力量,实现面向任务的对话表示多样化
分类: 计算和语言
作者: Weihao Zeng, Dayuan Fu, Keqing He, Yejie Wang, Yukai Xu, Weiran Xu
发布时间: 2024-03-31
链接: http://arxiv.org/abs/2404.00557v1
摘要: 在一般文本上预训练的语言模型在不同领域取得了令人印象深刻的成果。然而,与一般文本相比,任务导向对话(TOD)独特的语言特征限制了现有语言模型的实际用途。当前面向任务的对话预训练方法忽略了对话的一对多属性,在给定相同的对话上下文的情况下,多个响应可能是适当的。在本文中,我们提出了一种名为 DivTOD 的新型对话预训练模型,该模型与大语言模型合作学习各种面向任务的对话表示。 DivTOD 指导大语言模型将多样化的知识转移到较小的模型中,同时删除与面向任务的对话相矛盾的领域知识。实验表明,我们的模型在各种下游对话任务上优于强大的 TOD 基线,并学习了面向任务的对话的内在多样性。
“我的代理更了解我”:在基于 LLM 的代理中集成动态类人记忆回忆和巩固
分类: 人机交互, I.2.4; H.3.3
作者: Yuki Hou, Haruki Tamoto, Homei Miyashita
发布时间: 2024-03-31
链接: http://arxiv.org/abs/2404.00573v1
摘要: 在这项研究中,我们提出了一种新颖的类人记忆架构,旨在增强基于大型语言模型的对话代理的认知能力。我们提出的架构使代理能够自主回忆生成响应所需的记忆,有效解决大语言模型时间认知的限制。我们采用人类记忆线索回忆作为准确有效的记忆回忆的触发因素。此外,我们开发了一个数学模型,可以动态量化记忆巩固,考虑上下文相关性、经过时间和回忆频率等因素。代理将从用户交互历史中检索到的记忆存储在数据库中,该数据库封装了每个记忆的内容和时间上下文。因此,这种策略性存储允许代理回忆特定的记忆并在时间背景下理解它们对用户的重要性,类似于人类如何识别和回忆过去的经历。
CHOPS:与客户档案系统聊天,为大语言模型提供客户服务
分类: 计算和语言, 人工智能
作者: Jingzhe Shi, Jialuo Li, Qinwei Ma, Zaiwen Yang, Huan Ma, Lei Li
发布时间: 2024-03-31
链接: http://arxiv.org/abs/2404.01343v1
摘要: 企业和软件平台越来越多地转向大型语言模型 (LLM),例如 GPT-3.5、GPT-4、GLM-3 和 LLaMa-2,以提供文件访问聊天帮助或作为客户服务的推理代理。然而,当前基于大语言模型的客户服务模型与客户档案的整合有限,并且缺乏有效服务所需的运营能力。此外,现有的 API 集成强调多样性,而不是现实客户服务场景中至关重要的精度和避免错误。为了解决这些问题,我们提出了一种名为 CHOPS(现有系统中与客户资料的 CHat)的 LLM 代理,旨在:(1)有效利用现有数据库或系统来访问用户信息或遵循现有准则与这些系统进行交互; (二)在系统中提供准确、合理的响应或进行所需的操作,同时避免有害操作; (3) 利用小型和大型大语言模型的组合,以合理的推理成本实现令人满意的性能。我们介绍了一个实用的数据集,即 CPHOS 数据集,其中包括从 CPHOS 收集的数据库、指导文件和 QA 对,CPHOS 是一个在线平台,有助于为高中教师和学生组织模拟物理奥林匹克竞赛。我们进行了广泛的实验,使用 CPHOS 数据集验证我们提出的 CHOPS 架构的性能,目的是展示大语言模型如何增强或替代人类客户服务。我们的代码和数据集将很快开源。
人工智能法案和大型语言模型 (LLM):当关键问题和隐私影响需要人力和道德监督时
分类: 计算机与社会, 人工智能, 计算和语言
作者: Nicola Fabiano
发布时间: 2024-03-31
链接: http://arxiv.org/abs/2404.00600v2
摘要: 人工智能系统,特别是大型语言模型(LLM)的巨大发展使得有必要对其风险水平及其在隐私、个人数据保护和道德层面可能产生的影响进行评估,特别是对最弱者和最脆弱者。此贡献涉及人类监督、道德监督和隐私影响评估。
面对现实:基于 LLM 的两阶段策略,通过日志本地化配置错误
分类: 软件工程, 机器学习
作者: Shiwen Shan, Yintong Huo, Yuxin Su, Yichen Li, Dan Li, Zibin Zheng
发布时间: 2024-03-31
链接: http://arxiv.org/abs/2404.00640v2
摘要: 可配置的软件系统很容易出现配置错误,给公司造成重大损失。然而,由于配置空间庞大且复杂,诊断这些错误具有挑战性。这些错误给经验丰富的维护人员和新的最终用户,特别是那些无法访问软件系统源代码的用户带来了重大挑战。鉴于大多数最终用户都可以轻松访问日志,我们进行了一项初步研究,以概述利用日志来定位配置错误的挑战和机遇。根据初步研究获得的见解,我们提出了一种基于 LLM 的两阶段策略,供最终用户根据日志本地化根本原因配置属性。我们进一步实现了一个工具LogConfigLocalizer,符合上述策略的设计,希望通过日志分析来帮助最终用户应对配置错误。据我们所知,这是第一个基于大型语言模型(LLM)和日志本地化最终用户根本原因配置属性的工作。我们通过 LogConfigLocalizer 在 Hadoop 上评估所提出的策略,并证明其效率,平均准确率高达 99.91%。此外,我们还通过与其他两个变体和基线工具进行比较,证明了该方法不同阶段的有效性和必要性。此外,我们通过实际案例研究验证了所提出的方法,以证明其有效性和可行性。
LLM 满足零样本一类分类的视觉语言模型
分类: 计算机视觉和模式识别, 人工智能
作者: Yassir Bendou, Giulia Lioi, Bastien Pasdeloup, Lukas Mauch, Ghouthi Boukli Hacene, Fabien Cardinaux, Vincent Gripon
发布时间: 2024-03-31
链接: http://arxiv.org/abs/2404.00675v2
摘要: 我们考虑零样本一类视觉分类问题。在此设置中,只有目标类的标签可用,目标是区分正查询样本和负查询样本,而不需要目标任务中的任何验证示例。我们提出了一个两步解决方案,首先查询大型语言模型以查找视觉上令人困惑的对象,然后依靠视觉语言预训练模型(例如 CLIP)来执行分类。通过采用大规模视觉基准,我们证明了所提出的方法在这种情况下优于经过调整的现成替代方案的能力。也就是说,我们提出了一个现实的基准,其中负查询样本与正查询样本来自相同的原始数据集,包括 iNaturalist 的粒度控制版本,其中负样本在分类树中与正样本之间的距离固定。我们的工作表明,仅使用其标签就可以区分单个类别和其他语义相关的类别
LLM 受到多少污染?综合调查和 LLMSanitize 图书馆
分类: 计算和语言
作者: Mathieu Ravaut, Bosheng Ding, Fangkai Jiao, Hailin Chen, Xingxuan Li, Ruochen Zhao, Chengwei Qin, Caiming Xiong, Shafiq Joty
发布时间: 2024-03-31
链接: http://arxiv.org/abs/2404.00699v1
摘要: 近年来,随着大型语言模型(LLM)的兴起,新的机遇不断涌现,但也带来新的挑战,污染很快变得至关重要。人工智能领域的商业应用和筹款已经达到了这样的规模,在流行的问答基准上提高几个百分点就可能转化为数千万美元,这给模型的完整性带来了巨大的压力。与此同时,跟踪大语言模型所看到的数据变得越来越困难;如果使用 GPT-4 和 Claude-3 等闭源模型的话,不泄露有关训练集的任何信息也是不可能的。因此,污染成为一个关键问题:大语言模型的表现可能不再可靠,因为高性能可能至少部分归因于他们之前接触过的数据。这种限制危及了自然语言处理领域的整体进展,然而,仍然缺乏如何有效解决污染的方法,或者在污染的预防、缓解和分类方面缺乏明确的共识。在本文中,我们调查了最近所有关于 LLM 污染的工作,并通过发布一个名为 LLMSanitize 的开源 Python 库来帮助社区跟踪 LLM 的污染水平,该库实现了主要的污染检测算法,链接为:https://github.com /ntunlp/LLMSanitize。
通过大语言模型监督进行免培训语义分割
分类: 计算机视觉和模式识别
作者: Wenfang Sun, Yingjun Du, Gaowen Liu, Ramana Kompella, Cees G. M. Snoek
发布时间: 2024-03-31
链接: http://arxiv.org/abs/2404.00701v1
摘要: 开放词汇模型(例如 CLIP)的最新进展通过利用自然语言进行特定于类的嵌入,显着改进了零样本分类和分割。然而,大多数研究都集中在通过快速工程、快速学习或使用有限的标记数据进行微调来提高模型准确性,从而忽视了细化类描述符的重要性。本文介绍了一种使用大型语言模型(LLM)进行监督的文本监督语义分割新方法,不需要额外的训练。我们的方法从 LLM(如 GPT-3)开始,生成一组详细的子类,以实现更准确的类表示。然后,我们采用先进的文本监督语义分割模型将生成的子类应用为目标标签,从而根据每个子类的独特特征定制不同的分割结果。此外,我们提出了一个程序集,该程序集合并来自各个子类描述符的分割图,以确保更全面地表示测试图像中的不同方面。通过对三个标准基准的综合实验,我们的方法明显优于传统的文本监督语义分割方法。
越大越好吗?通过预算重新分配改进 LLM 代码生成
分类: 软件工程, 人工智能, 计算和语言, 机器学习
作者: Michael Hassid, Tal Remez, Jonas Gehring, Roy Schwartz, Yossi Adi
发布时间: 2024-03-31
链接: http://arxiv.org/abs/2404.00725v1
摘要: 人们普遍认为大型语言模型 (LLM) 比小型语言模型更好。然而,较大的模型在推理过程中也需要更多的时间和计算。这就引出了一个问题:当两种模型在相同的预算下运行时会发生什么? (例如,计算、运行时)。为了解决这个问题,我们分析了各种规模的代码生成 LLM 并进行了比较,例如运行一次 70B 模型与从 13B 模型生成五个输出并选择一个。我们的研究结果表明,在标准单元测试设置中,重复使用较小的模型可以产生一致的改进,五项任务的增益高达 15%。另一方面,在单元测试不可用的情况下,从较小模型中基于排名选择候选者的性能低于较大模型的单个输出的性能。我们的结果强调了使用较小模型而不是较大模型的潜力,以及研究对大语言模型输出进行排名的方法的重要性。
事实解码对大语言模型来说是免费午餐吗?知识编辑基准评估
分类: 计算和语言, 人工智能
作者: Baolong Bi, Shenghua Liu, Yiwei Wang, Lingrui Mei, Xueqi Cheng
发布时间: 2024-03-30
链接: http://arxiv.org/abs/2404.00216v1
摘要: 大型语言模型(LLM)的快速发展使它们能够以更类似于人类的方式传达事实知识。通过修改事实解码大语言模型,人们做出了广泛的努力来减少事实幻觉。然而,它们也带来了阻碍知识更新的风险,因为它们使模型对已知事实过于自信。在这项工作中,我们首先重新审视当前的事实性解码方法,并验证它们在提高事实准确性方面的有效性。随后,我们在知识编辑基准上对几种强事实性解码方法进行了进一步评估。与原始解码相比,所有这些解码方法都显着降低了 llama2 模型的性能,最大降幅达到了惊人的 81.3%。这进一步表明,当前现有的解码方法仍然无法完美解决事实幻觉,因为它们忽视了保留知识编辑灵活性的重要性。因此,我们的工作表明,对事实对齐的研究应同时关注知识编辑的有效性。
DeFT:具有 IO 意识的 Flash 树注意力,可实现基于树搜索的高效 LLM 推理
分类: 计算和语言, 人工智能
作者: Jinwei Yao, Kaiqi Chen, Kexun Zhang, Jiaxuan You, Binhang Yuan, Zeke Wang, Tao Lin
发布时间: 2024-03-30
链接: http://arxiv.org/abs/2404.00242v1
摘要: 使用树搜索进行解码可以极大地提高基于 Transformer 的大型语言模型 (LLM) 的推理质量。根据引导信号,它通过形成 LLM 输出来搜索树中从根到叶的最佳路径,以提高可控性、推理能力、对齐等。然而,由于计算、内存占用和内存访问方面的冗余,当前的树解码策略及其推理系统不能很好地适应,从而导致推理效率低下。为了解决这个问题,我们提出了 DeFT,一种 IO 感知的树注意力算法,它分两个阶段保持内存高效的注意力计算和低内存占用:(1)QKV 准备:我们提出了一种 KV 引导树分割策略来明智地对 QKV 进行分组为了提高GPU的利用率,并尽可能减少GPU全局内存和片上共享内存之间的KV缓存的内存读写; (2)注意力计算:我们计算融合内核中每个QKV组的部分注意力,然后应用树拓扑感知的全局缩减策略来获得最终的注意力。由于 KV 缓存 IO 减少了 3.6-4.5$\times$,同时 $\mathbf{Q} \mathbf{K}^\top$ 的 IO 进一步减少,Softmax 相当于总 KV 的 25%与 SOTA 注意力算法相比,DeFT 在两个实际推理任务中的端到端延迟可以实现 1.7-2.4$\times$ 的加速。
秘密守护者:大语言模型对个人特质语言标记的影响
分类: 计算和语言
作者: Zhivar Sourati, Meltem Ozcan, Colin McDaniel, Alireza Ziabari, Nuan Wen, Ala Tak, Fred Morstatter, Morteza Dehghani
发布时间: 2024-03-30
链接: http://arxiv.org/abs/2404.00267v1
摘要: 先前的研究已经建立了个人语言使用与其个人特质之间的联系。我们的语言模式揭示了我们的性格、情绪状态和信仰的信息。然而,随着大型语言模型 (LLM) 在日常写作中越来越多地采用写作助手,一个关键问题出现了:当 LLM 参与写作过程时,作者的语言模式是否仍然可以预测他们的个人特征?我们研究了大语言模型对人口和心理特征的语言标记的影响,特别研究了三个大语言模型 - GPT3.5、Llama 2 和 Gemini - 涵盖六种不同的特征:性别、年龄、政治倾向、个性、同理心和道德。我们的研究结果表明,虽然大语言模型的使用稍微降低了语言模式对作者个人特质的预测能力,但显着的变化并不常见,而且大语言模型的使用并没有完全削弱作者语言模式对其个人特质的预测能力。我们还注意到,当大语言模型用于写作过程时,一些理论上建立的基于词汇的语言标记失去了作为预测变量的可靠性。我们的研究结果对于大语言模型时代个人特质的语言标记研究具有重要意义。
用于仇恨言论检测的 NLP 数据增强综合研究:传统方法、BERT 和大语言模型
分类: 计算和语言
作者: Md Saroar Jahan, Mourad Oussalah, Djamila Romaissa Beddia, Jhuma kabir Mim, Nabil Arhab
发布时间: 2024-03-30
链接: http://arxiv.org/abs/2404.00303v1
摘要: 由于需要解决仇恨言论领域带来的挑战、社交媒体词汇的动态性质以及需要大量训练数据的大规模神经网络的需求,推动了 NLP 领域对数据增强的兴趣激增。然而,数据增强中词汇替换的普遍使用引起了人们的担忧,因为它可能会无意中改变预期含义,从而影响监督机器学习模型的有效性。为了寻求合适的数据增强方法,本研究探索了既定的传统方法和当代实践,例如大型语言模型 (LLM),包括仇恨言论检测中的 GPT。此外,我们提出了基于 BERT 的编码器模型的优化利用和上下文余弦相似度过滤,暴露了先前同义词替换方法的显着局限性。我们的比较分析涵盖五种流行的增强技术:WordNet 和 Fast-Text 同义词替换、反向翻译、BERT-mask 上下文增强和大语言模型。我们对五个基准数据集的分析表明,虽然反向翻译等传统方法显示出较低的标签更改率(0.3-1.5%),而基于 BERT 的上下文同义词替换提供了句子多样性,但代价是较高的标签更改率(超过 6%) )。我们提出的基于 BERT 的上下文余弦相似度过滤将标签更改显着减少到仅 0.05%,证明了其 F1 性能提高 0.7% 的功效。然而,使用 GPT-3 增强数据不仅避免了过度拟合,数据增加了多达七倍,而且与传统方法相比,嵌入空间覆盖率提高了 15%,分类 F1 分数提高了 1.4%,比我们的方法提高了 0.8%。
使用大语言模型增强 NER 数据集:迈向自动化和精细化注释
分类: 计算和语言, 机器学习
作者: Yuji Naraki, Ryosuke Yamaki, Yoshikazu Ikeda, Takafumi Horie, Hiroki Naganuma
发布时间: 2024-03-30
链接: http://arxiv.org/abs/2404.01334v1
摘要: 在自然语言处理 (NLP) 领域,命名实体识别 (NER) 被认为是一项关键技术,广泛应用于各种应用中。为 NER 模型注释数据集的传统方法受到高成本和数据集质量变化的挑战。这项研究引入了一种新颖的混合注释方法,该方法将人类的努力与大型语言模型(LLM)的功能相结合。这种方法不仅旨在改善手动注释中固有的噪声(例如遗漏),从而提高 NER 模型的性能,而且还以经济高效的方式实现这一目标。此外,通过采用标签混合策略,它解决了基于 LLM 的注释中遇到的类不平衡问题。通过对多个数据集的分析,即使在预算有限的情况下,与传统注释方法相比,该方法始终能够提供卓越的性能。这项研究阐明了利用大语言模型提高数据集质量的潜力,引入了一种减轻类别不平衡的新技术,并证明了以经济高效的方式实现高性能 NER 的可行性。
LLM 可以掌握数学吗?研究 Math Stack Exchange 上的大型语言模型
分类: 计算和语言, 人工智能, 信息检索
作者: Ankit Satpute, Noah Giessing, Andre Greiner-Petter, Moritz Schubotz, Olaf Teschke, Akiko Aizawa, Bela Gipp
发布时间: 2024-03-30
链接: http://arxiv.org/abs/2404.00344v1
摘要: 大型语言模型 (LLM) 在各种自然语言任务中表现出了卓越的能力,通常能够实现超越人类的表现。尽管取得了这些进步,数学领域仍然面临着独特的挑战,这主要是由于其特殊的结构和所需的精度。在这项研究中,我们采用了两步法来调查大语言模型回答数学问题的熟练程度。首先,我们聘用了最有效的大语言模型(根据其在数学问答基准上的表现来确定),从 Math Stack Exchange (MSE) 中生成 78 个问题的答案。其次,对表现最好的LLM进行案例分析,通过人工评估重点关注其答案的质量和准确性。我们发现,在针对回答数学问题进行微调的现有大语言模型中,GPT-4 表现最佳(nDCG 为 0.48,P@10 为 0.37),并且考虑到 P@10,其性能优于 ArqMATH3 Task1 上当前的最佳方法。我们的案例分析表明,虽然 GPT-4 在某些情况下可以生成相关响应,但它并不能始终如一地准确回答所有问题。本文探讨了大语言模型目前在解决复杂数学问题方面的局限性。通过案例分析,我们揭示了大语言模型在数学能力方面的差距,从而为人工智能驱动的数学推理的未来研究和进步奠定了基础。我们公开我们的代码和研究结果以供研究:\url{https://github.com/gipplab/LLM-Investig-MathStackExchange}
人类与大语言模型互动模式的分类:初步探索
分类: 人机交互
作者: Jie Gao, Simret Araya Gebreegziabher, Kenny Tsu Wei Choo, Toby Jia-Jun Li, Simon Tangi Perrault, Thomas W. Malone
发布时间: 2024-03-30
链接: http://arxiv.org/abs/2404.00405v1
摘要: 随着 ChatGPT 的发布,对话提示已成为人类与大语言模型互动的最流行形式。然而,对于涉及推理、创造力和迭代的更复杂的任务,其有效性有限。通过对 2021 年以来发表的 HCI 论文进行系统分析,我们确定了人与 LLM 交互流程中的四个关键阶段——规划、促进、迭代和测试——以准确理解这一过程的动态。此外,我们还开发了四种主要交互模式的分类:模式 1:标准提示、模式 2:用户界面、模式 3:基于上下文和模式 4:代理协助者。使用“5W1H”指导方法进一步丰富了该分类法,其中涉及对定义、参与者角色(谁)、发生的阶段(何时)、人类目标和大语言模型能力(什么)以及每次交互的机制的详细检查模式(如何)。我们预计这种分类法将有助于未来人类与大语言模型互动的设计和评估。
QuaRot:旋转 LLM 中的无异常值 4 位推理
分类: 机器学习
作者: Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Martin Jaggi, Dan Alistarh, Torsten Hoefler, James Hensman
发布时间: 2024-03-30
链接: http://arxiv.org/abs/2404.00456v1
摘要: 我们引入了 QuaRot,一种基于旋转的新量化方案,它能够端到端量化 LLM,包括所有权重、激活和 4 位 KV 缓存。 QuaRot 以一种从隐藏状态中删除异常值而不改变输出的方式旋转 LLM,从而使量化变得更容易。这种计算不变性应用于 LLM 的隐藏状态(残差),以及前馈组件的激活、注意力机制的各个方面以及 KV 缓存。结果是一个量化模型,其中所有矩阵乘法均以 4 位执行,无需识别任何通道以保持更高精度。我们的量化 LLaMa2-70B 模型损失最多 0.29 WikiText-2 困惑度,并保留 99% 的零样本性能。代码位于:https://github.com/spcl/QuaRot。
MetaIE:从 LLM 中提取元模型,用于各种信息提取任务
分类: 计算和语言
作者: Letian Peng, Zilong Wang, Feng Yao, Zihan Wang, Jingbo Shang
发布时间: 2024-03-30
链接: http://arxiv.org/abs/2404.00457v1
摘要: 信息提取 (IE) 是自然语言处理的一个基本领域,其中提示大型语言模型 (LLM),即使有上下文示例,也无法击败在非常小的 IE 数据集上调整的小型 LM。我们观察到 IE 任务,例如命名实体识别和关系提取,都专注于提取重要信息,这些信息可以形式化为标签到跨度匹配。在本文中,我们提出了一种新颖的框架MetaIE,通过学习提取“重要信息”(即IE的元理解)来构建小型LM作为元模型,从而使该元模型可以适应所有类型的IE 有效且高效地执行任务。具体来说,MetaIE 按照标签到跨度方案通过 LLM 的符号蒸馏获得小型 LM。我们通过从语言模型预训练数据集(例如我们实现中的 OpenWebText)中采样句子来构建蒸馏数据集,并提示 LLM 识别“重要信息”的类型跨度。我们在少样本适应设置下评估元模型。来自 6 个 IE 任务的 13 个数据集的广泛结果证实,MetaIE 可以为 IE 数据集上的小样本调整提供更好的起点,并且优于其他元模型(1)普通语言模型预训练,(2)多 IE-使用人工注释进行任务预训练,以及 (3) 来自 LLM 的单 IE 任务符号蒸馏。此外,我们还提供了 MetaIE 的全面分析,例如蒸馏数据集的大小、元模型架构和元模型的大小。
NumeroLogic:增强大语言模型数字推理的数字编码
分类: 计算和语言
作者: Eli Schwartz, Leshem Choshen, Joseph Shtok, Sivan Doveh, Leonid Karlinsky, Assaf Arbelle
发布时间: 2024-03-30
链接: http://arxiv.org/abs/2404.00459v1
摘要: 语言模型在处理数字数据和执行算术运算方面遇到了困难。我们假设这种限制部分归因于非直观的文本数字表示。当因果语言模型读取或生成数字时,在处理整个数字之前,它不知道其位置值(例如千与百)。为了解决这个问题,我们建议对数字的表示方式进行简单的调整,即在每个数字之前包含位数。例如,我们建议使用“{2:42}”作为新格式,而不是“42”。这种方法,我们称之为数字逻辑,通过充当思想链(CoT),在数字生成方面提供了额外的优势。通过要求模型首先考虑位数,它增强了生成实际数字之前的推理过程。我们使用算术任务来演示数字逻辑格式化的有效性。我们进一步证明了 NumeroLogic 对一般自然语言建模的适用性,提高了 MMLU 基准中的语言理解性能。
辩证一致:解决 3H 的紧张局势和大语言模型的安全威胁
分类: 计算和语言, 人工智能
作者: Shu Yang, Jiayuan Su, Han Jiang, Mengdi Li, Keyuan Cheng, Muhammad Asif Ali, Lijie Hu, Di Wang
发布时间: 2024-03-30
链接: http://arxiv.org/abs/2404.00486v1
摘要: 随着大型语言模型 (LLM) 的兴起,确保它们体现有益、诚实和无害 (3H) 的原则(即人类对齐)变得至关重要。虽然 RLHF、DPO 等现有的对齐方法可以有效地微调 LLM 以匹配偏好数据集中的偏好,但它们通常会导致 LLM 高度接受人类输入和外部证据,即使这些信息被毒化。当外部证据与其参数记忆发生冲突时,这导致大语言模型倾向于成为适应性变色龙。这加剧了LLM受到外部中毒数据攻击的风险,从而对检索增强生成(RAG)等LLM系统应用带来重大安全风险。为了应对这一挑战,我们提出了一个新颖的框架:辩证对齐(DA),它(1)利用人工智能反馈来确定大语言模型的最佳策略,以在上下文窗口(即上下文窗口中的不同外部证据)中导航上下文间冲突和上下文记忆冲突。 ,不同比例的中毒事实上下文); (2)根据上述AI反馈和策略构建SFT数据集以及偏好数据集; (3)使用上述数据集进行LLM对齐,以防御中毒上下文攻击,同时保留上下文知识编辑的有效性。我们的实验表明,辩证对齐模型将中毒数据攻击防御能力提高了 20,并且不需要任何额外的提示工程或事先向 LLM 上下文窗口声明“您可能会受到攻击”。
情境人工智能日记:整合大语言模型和时间序列行为感知技术,使用 MindScape 应用程序促进自我反思和幸福感
分类: 人机交互, 人工智能, H.5.0; H.5.3; H.5.m; J.0
作者: Subigya Nepal, Arvind Pillai, William Campbell, Talie Massachi, Eunsol Soul Choi, Orson Xu, Joanna Kuc, Jeremy Huckins, Jason Holden, Colin Depp, Nicholas Jacobson, Mary Czerwinski, Eric Granholm, Andrew T. Campbell
发布时间: 2024-03-30
链接: http://arxiv.org/abs/2404.00487v1
摘要: MindScape 旨在研究将时间序列行为模式(例如对话参与、睡眠、位置)与大型语言模型 (LLM) 相结合的好处,以创建一种新形式的情境 AI 日记,促进自我反思和幸福感。我们认为,将行为感知融入大语言模型可能会开辟人工智能的新领域。在这篇最新工作论文中,我们讨论了 MindScape 情境日记应用程序的设计,该应用程序使用大语言模型和行为感知来生成情境和个性化的日记提示,旨在鼓励自我反思和情感发展。我们还讨论了基于初步用户研究的 MindScape 对大学生的研究,以及我们即将进行的研究,以评估情境人工智能日记在促进大学校园更好的福祉方面的有效性。 MindScape 代表了一种将行为智能嵌入人工智能的新应用程序类别。
Are LLMs Effective Backbones for Fine-tuning? An Experimental Investigation of Supervised LLMs on Chinese Short Text Matching
分类: 计算和语言
作者: Shulin Liu, Chengcheng Xu, Hao Liu, Tinghao Yu, Tao Yang
发布时间: 2024-03-29
链接: http://arxiv.org/abs/2403.19930v1
摘要: The recent success of Large Language Models (LLMs) has garnered significant attention in both academia and industry. Prior research on LLMs has primarily focused on enhancing or leveraging their generalization capabilities in zero- and few-shot settings. However, there has been limited investigation into effectively fine-tuning LLMs for a specific natural language understanding task in supervised settings. In this study, we conduct an experimental analysis by fine-tuning LLMs for the task of Chinese short text matching. We explore various factors that influence performance when fine-tuning LLMs, including task modeling methods, prompt formats, and output formats.
Enhancing the General Agent Capabilities of Low-Parameter LLMs through Tuning and Multi-Branch Reasoning
分类: 计算和语言, 人工智能, 机器学习
作者: Qinhao Zhou, Zihan Zhang, Xiang Xiang, Ke Wang, Yuchuan Wu, Yongbin Li
发布时间: 2024-03-29
链接: http://arxiv.org/abs/2403.19962v1
摘要: Open-source pre-trained Large Language Models (LLMs) exhibit strong language understanding and generation capabilities, making them highly successful in a variety of tasks. However, when used as agents for dealing with complex problems in the real world, their performance is far inferior to large commercial models such as ChatGPT and GPT-4. As intelligent agents, LLMs need to have the capabilities of task planning, long-term memory, and the ability to leverage external tools to achieve satisfactory performance. Various methods have been proposed to enhance the agent capabilities of LLMs. On the one hand, methods involve constructing agent-specific data and fine-tuning the models. On the other hand, some methods focus on designing prompts that effectively activate the reasoning abilities of the LLMs. We explore both strategies on the 7B and 13B models. We propose a comprehensive method for constructing agent-specific data using GPT-4. Through supervised fine-tuning with constructed data, we find that for these models with a relatively small number of parameters, supervised fine-tuning can significantly reduce hallucination outputs and formatting errors in agent tasks. Furthermore, techniques such as multi-path reasoning and task decomposition can effectively decrease problem complexity and enhance the performance of LLMs as agents. We evaluate our method on five agent tasks of AgentBench and achieve satisfactory results.
Can LLMs Learn from Previous Mistakes? Investigating LLMs' Errors to Boost for Reasoning
分类: 计算和语言
作者: Yongqi Tong, Dawei Li, Sizhe Wang, Yujia Wang, Fei Teng, Jingbo Shang
发布时间: 2024-03-29
链接: http://arxiv.org/abs/2403.20046v1
摘要: Recent works have shown the benefits to LLMs from fine-tuning golden-standard Chain-of-Thought (CoT) rationales or using them as correct examples in few-shot prompting. While humans can indeed imitate correct examples, learning from our mistakes is another vital aspect of human cognition. Hence, a question naturally arises: \textit{can LLMs learn and benefit from their mistakes, especially for their reasoning? } This study investigates this problem from both the prompting and model-tuning perspectives. We begin by introducing \textsc{CoTErrorSet}, a new benchmark with 609,432 questions, each designed with both correct and error references, and demonstrating the types and reasons for making such mistakes. To explore the effectiveness of those mistakes, we design two methods: (1) \textbf{Self-rethinking} prompting guides LLMs to rethink whether they have made similar previous mistakes; and (2) \textbf{Mistake tuning} involves finetuning models in both correct and incorrect reasoning domains, rather than only tuning models to learn ground truth in traditional methodology. We conduct a series of experiments to prove LLMs can obtain benefits from mistakes in both directions. Our two methods offer potentially cost-effective strategies by leveraging errors to enhance reasoning capabilities, which costs significantly less than creating meticulously hand-crafted golden references. We ultimately make a thorough analysis of the reasons behind LLMs' errors, which provides directions that future research needs to overcome. \textsc{CoTErrorSet} will be published soon on \texttt{Anonymity Link}.
Accurate Block Quantization in LLMs with Outliers
分类: 人工智能, 硬件架构, 数值分析, 数值分析
作者: Nikita Trukhanov, Ilya Soloveychik
发布时间: 2024-03-29
链接: http://arxiv.org/abs/2403.20137v1
摘要: The demand for inference on extremely large scale LLMs has seen enormous growth in the recent months. It made evident the colossal shortage of dedicated hardware capable of efficient and fast processing of the involved compute and memory movement. The problem is aggravated by the exploding raise in the lengths of the sequences being processed, since those require efficient on-chip storage of the KV-cache of size proportional to the sequence length. To make the required compute feasible and fit the involved data into available memory, numerous quantization techniques have been proposed that allow accurate quantization for both weights and activations. One of the main recent breakthroughs in this direction was introduction of the family of Block Floating Point (BFP) formats characterized by a block of mantissas with a shared scale factor. These enable memory- power-, and compute- efficient hardware support of the tensor operations and provide extremely good quantization accuracy. The main issues preventing widespread application of block formats is caused by the presence of outliers in weights and activations since those affect the accuracy of the other values in the same block. In this paper, we focus on the most critical problem of limited KV-cache storage. We propose a novel approach enabling usage of low precision BFP formats without compromising the resulting model accuracy. We exploit the common channel-wise patterns exhibited by the outliers to rearrange them in such a way, that their quantization quality is significantly improved. The methodology yields 2x savings in the memory footprint without significant degradation of the model's accuracy. Importantly, the rearrangement of channels happens at the compile time and thus has no impact on the inference latency.
Using LLMs to Model the Beliefs and Preferences of Targeted Populations
分类: 计算和语言, 人工智能, 机器学习
作者: Keiichi Namikoshi, Alex Filipowicz, David A. Shamma, Rumen Iliev, Candice L. Hogan, Nikos Arechiga
发布时间: 2024-03-29
链接: http://arxiv.org/abs/2403.20252v1
摘要: We consider the problem of aligning a large language model (LLM) to model the preferences of a human population. Modeling the beliefs, preferences, and behaviors of a specific population can be useful for a variety of different applications, such as conducting simulated focus groups for new products, conducting virtual surveys, and testing behavioral interventions, especially for interventions that are expensive, impractical, or unethical. Existing work has had mixed success using LLMs to accurately model human behavior in different contexts. We benchmark and evaluate two well-known fine-tuning approaches and evaluate the resulting populations on their ability to match the preferences of real human respondents on a survey of preferences for battery electric vehicles (BEVs). We evaluate our models against their ability to match population-wide statistics as well as their ability to match individual responses, and we investigate the role of temperature in controlling the trade-offs between these two. Additionally, we propose and evaluate a novel loss term to improve model performance on responses that require a numeric response.
LUQ: Long-text Uncertainty Quantification for LLMs
分类: 计算和语言
作者: Caiqi Zhang, Fangyu Liu, Marco Basaldella, Nigel Collier
发布时间: 2024-03-29
链接: http://arxiv.org/abs/2403.20279v1
摘要: Large Language Models (LLMs) have demonstrated remarkable capability in a variety of NLP tasks. Despite their effectiveness, these models are prone to generate nonfactual content. Uncertainty Quantification (UQ) is pivotal in enhancing our understanding of a model's confidence in its generated content, thereby aiding in the mitigation of nonfactual outputs. Existing research on UQ predominantly targets short text generation, typically yielding brief, word-limited responses. However, real-world applications frequently necessitate much longer responses. Our study first highlights the limitations of current UQ methods in handling long text generation. We then introduce \textsc{Luq}, a novel sampling-based UQ approach specifically designed for long text. Our findings reveal that \textsc{Luq} outperforms existing baseline methods in correlating with the model's factuality scores (negative coefficient of -0.85 observed for Gemini Pro). With \textsc{Luq} as the tool for UQ, we investigate behavior patterns of several popular LLMs' response confidence spectrum and how that interplays with the response' factuality. We identify that LLMs lack confidence in generating long text for rare facts and a factually strong model (i.e. GPT-4) tends to reject questions it is not sure about. To further improve the factual accuracy of LLM responses, we propose a method called \textsc{Luq-Ensemble} that ensembles responses from multiple models and selects the response with the least uncertainty. The ensembling method greatly improves the response factuality upon the best standalone LLM.
Can LLMs Correct Physicians, Yet? Investigating Effective Interaction Methods in the Medical Domain
分类: 计算和语言, 人工智能
作者: Burcu Sayin, Pasquale Minervini, Jacopo Staiano, Andrea Passerini
发布时间: 2024-03-29
链接: http://arxiv.org/abs/2403.20288v1
摘要: We explore the potential of Large Language Models (LLMs) to assist and potentially correct physicians in medical decision-making tasks. We evaluate several LLMs, including Meditron, Llama2, and Mistral, to analyze the ability of these models to interact effectively with physicians across different scenarios. We consider questions from PubMedQA and several tasks, ranging from binary (yes/no) responses to long answer generation, where the answer of the model is produced after an interaction with a physician. Our findings suggest that prompt design significantly influences the downstream accuracy of LLMs and that LLMs can provide valuable feedback to physicians, challenging incorrect diagnoses and contributing to more accurate decision-making. For example, when the physician is accurate 38% of the time, Mistral can produce the correct answer, improving accuracy up to 74% depending on the prompt being used, while Llama2 and Meditron models exhibit greater sensitivity to prompt choice. Our analysis also uncovers the challenges of ensuring that LLM-generated suggestions are pertinent and useful, emphasizing the need for further research in this area.
Towards Greener LLMs: Bringing Energy-Efficiency to the Forefront of LLM Inference
分类: 人工智能, 硬件架构, 分布式、并行和集群计算, C.0; I.2
作者: Jovan Stojkovic, Esha Choukse, Chaojie Zhang, Inigo Goiri, Josep Torrellas
发布时间: 2024-03-29
链接: http://arxiv.org/abs/2403.20306v1
摘要: With the ubiquitous use of modern large language models (LLMs) across industries, the inference serving for these models is ever expanding. Given the high compute and memory requirements of modern LLMs, more and more top-of-the-line GPUs are being deployed to serve these models. Energy availability has come to the forefront as the biggest challenge for data center expansion to serve these models. In this paper, we present the trade-offs brought up by making energy efficiency the primary goal of LLM serving under performance SLOs. We show that depending on the inputs, the model, and the service-level agreements, there are several knobs available to the LLM inference provider to use for being energy efficient. We characterize the impact of these knobs on the latency, throughput, as well as the energy. By exploring these trade-offs, we offer valuable insights into optimizing energy usage without compromising on performance, thereby paving the way for sustainable and cost-effective LLM deployment in data center environments.
Learning From Correctness Without Prompting Makes LLM Efficient Reasoner
分类: 计算和语言
作者: Yuxuan Yao, Han Wu, Zhijiang Guo, Biyan Zhou, Jiahui Gao, Sichun Luo, Hanxu Hou, Xiaojin Fu, Linqi Song
发布时间: 2024-03-28
链接: http://arxiv.org/abs/2403.19094v1
摘要: Large language models (LLMs) have demonstrated outstanding performance across various tasks, yet they still exhibit limitations such as hallucination, unfaithful reasoning, and toxic content. One potential approach to mitigate these issues is learning from human or external feedback (e.g. tools). In this paper, we introduce an intrinsic self-correct reasoning framework for LLMs that eliminates the need for human feedback, external tools, and handcraft prompts. The proposed framework, based on a multi-step reasoning paradigm \textbf{Le}arning from \textbf{Co}rrectness (\textsc{LeCo}), improves reasoning performance without needing to learn from errors. This paradigm prioritizes learning from correct reasoning steps, and a unique method to measure confidence for each reasoning step based on generation logits. Experimental results across various multi-step reasoning tasks demonstrate the effectiveness of the framework in improving reasoning performance with reduced token consumption.
Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM
分类: 软件工程, 计算和语言, 机器学习, 编程语言
作者: Chunqiu Steven Xia, Yinlin Deng, Lingming Zhang
发布时间: 2024-03-28
链接: http://arxiv.org/abs/2403.19114v1
摘要: LLMs have become the go-to choice for code generation tasks, with an exponential increase in the training, development, and usage of LLMs specifically for code generation. To evaluate the ability of LLMs on code, both academic and industry practitioners rely on popular handcrafted benchmarks. However, prior benchmarks contain only a very limited set of problems, both in quantity and variety. Further, due to popularity and age, many benchmarks are prone to data leakage where example solutions can be readily found on the web and thus potentially in training data. Such limitations inevitably lead us to inquire: Is the leaderboard performance on existing benchmarks reliable and comprehensive enough to measure the program synthesis ability of LLMs? To address this, we introduce EvoEval -- a program synthesis benchmark suite created by evolving existing benchmarks into different targeted domains for a comprehensive evaluation of LLM coding abilities. Our study on 51 LLMs shows that compared to the high performance obtained on standard benchmarks like HumanEval, there is a significant drop in performance (on average 39.4%) when using EvoEval. Additionally, the decrease in performance can range from 19.6% to 47.7%, leading to drastic ranking changes amongst LLMs and showing potential overfitting of existing benchmarks. Furthermore, we showcase various insights, including the brittleness of instruction-following models when encountering rewording or subtle changes as well as the importance of learning problem composition and decomposition. EvoEval not only provides comprehensive benchmarks, but can be used to further evolve arbitrary problems to keep up with advances and the ever-changing landscape of LLMs for code. We have open-sourced our benchmarks, tools, and complete LLM generations at https://github.com/evo-eval/evoeval
Generate then Retrieve: Conversational Response Retrieval Using LLMs as Answer and Query Generators
分类: 信息检索
作者: Zahra Abbasiantaeb, Mohammad Aliannejadi
发布时间: 2024-03-28
链接: http://arxiv.org/abs/2403.19302v1
摘要: CIS is a prominent area in IR that focuses on developing interactive knowledge assistants. These systems must adeptly comprehend the user's information requirements within the conversational context and retrieve the relevant information. To this aim, the existing approaches model the user's information needs with one query called rewritten query and use this query for passage retrieval. In this paper, we propose three different methods for generating multiple queries to enhance the retrieval. In these methods, we leverage the capabilities of large language models (LLMs) in understanding the user's information need and generating an appropriate response, to generate multiple queries. We implement and evaluate the proposed models utilizing various LLMs including GPT-4 and Llama-2 chat in zero-shot and few-shot settings. In addition, we propose a new benchmark for TREC iKAT based on gpt 3.5 judgments. Our experiments reveal the effectiveness of our proposed models on the TREC iKAT dataset.
TableLLM: Enabling Tabular Data Manipulation by LLMs in Real Office Usage Scenarios
分类: 计算和语言
作者: Xiaokang Zhang, Jing Zhang, Zeyao Ma, Yang Li, Bohan Zhang, Guanlin Li, Zijun Yao, Kangli Xu, Jinchang Zhou, Daniel Zhang-Li, Jifan Yu, Shu Zhao, Juanzi Li, Jie Tang
发布时间: 2024-03-28
链接: http://arxiv.org/abs/2403.19318v1
摘要: We introduce TableLLM, a robust large language model (LLM) with 13 billion parameters, purpose-built for proficiently handling tabular data manipulation tasks, whether they are embedded within documents or spreadsheets, catering to real-world office scenarios. We propose a distant supervision method for training, which comprises a reasoning process extension strategy, aiding in training LLMs to understand reasoning patterns more effectively as well as a cross-way validation strategy, ensuring the quality of the automatically generated data. To evaluate the performance of TableLLM, we have crafted a benchmark tailored to address both document and spreadsheet formats as well as constructed a well-organized evaluation pipeline capable of handling both scenarios. Thorough evaluations underscore the advantages of TableLLM when compared to various existing general-purpose and tabular data-focused LLMs. We have publicly released the model checkpoint, source code, benchmarks, and a web application for user interaction.
Breaking the Length Barrier: LLM-Enhanced CTR Prediction in Long Textual User Behaviors
分类: 信息检索, 人工智能
作者: Binzong Geng, Zhaoxin Huan, Xiaolu Zhang, Yong He, Liang Zhang, Fajie Yuan, Jun Zhou, Linjian Mo
发布时间: 2024-03-28
链接: http://arxiv.org/abs/2403.19347v1
摘要: With the rise of large language models (LLMs), recent works have leveraged LLMs to improve the performance of click-through rate (CTR) prediction. However, we argue that a critical obstacle remains in deploying LLMs for practical use: the efficiency of LLMs when processing long textual user behaviors. As user sequences grow longer, the current efficiency of LLMs is inadequate for training on billions of users and items. To break through the efficiency barrier of LLMs, we propose Behavior Aggregated Hierarchical Encoding (BAHE) to enhance the efficiency of LLM-based CTR modeling. Specifically, BAHE proposes a novel hierarchical architecture that decouples the encoding of user behaviors from inter-behavior interactions. Firstly, to prevent computational redundancy from repeated encoding of identical user behaviors, BAHE employs the LLM's pre-trained shallow layers to extract embeddings of the most granular, atomic user behaviors from extensive user sequences and stores them in the offline database. Subsequently, the deeper, trainable layers of the LLM facilitate intricate inter-behavior interactions, thereby generating comprehensive user embeddings. This separation allows the learning of high-level user representations to be independent of low-level behavior encoding, significantly reducing computational complexity. Finally, these refined user embeddings, in conjunction with correspondingly processed item embeddings, are incorporated into the CTR model to compute the CTR scores. Extensive experimental results show that BAHE reduces training time and memory by five times for CTR models using LLMs, especially with longer user sequences. BAHE has been deployed in a real-world system, allowing for daily updates of 50 million CTR data on 8 A100 GPUs, making LLMs practical for industrial CTR prediction.
Checkpoint Merging via Bayesian Optimization in LLM Pretraining
分类: 计算和语言
作者: Deyuan Liu, Zecheng Wang, Bingning Wang, Weipeng Chen, Chunshan Li, Zhiying Tu, Dianhui Chu, Bo Li, Dianbo Sui
发布时间: 2024-03-28
链接: http://arxiv.org/abs/2403.19390v1
摘要: The rapid proliferation of large language models (LLMs) such as GPT-4 and Gemini underscores the intense demand for resources during their training processes, posing significant challenges due to substantial computational and environmental costs. To alleviate this issue, we propose checkpoint merging in pretraining LLM. This method utilizes LLM checkpoints with shared training trajectories, and is rooted in an extensive search space exploration for the best merging weight via Bayesian optimization. Through various experiments, we demonstrate that: (1) Our proposed methodology exhibits the capacity to augment pretraining, presenting an opportunity akin to obtaining substantial benefits at minimal cost; (2) Our proposed methodology, despite requiring a given held-out dataset, still demonstrates robust generalization capabilities across diverse domains, a pivotal aspect in pretraining.
Enhancing Anomaly Detection in Financial Markets with an LLM-based Multi-Agent Framework
分类: 风险管理
作者: Taejin Park
发布时间: 2024-03-28
链接: http://arxiv.org/abs/2403.19735v1
摘要: This paper introduces a Large Language Model (LLM)-based multi-agent framework designed to enhance anomaly detection within financial market data, tackling the longstanding challenge of manually verifying system-generated anomaly alerts. The framework harnesses a collaborative network of AI agents, each specialised in distinct functions including data conversion, expert analysis via web research, institutional knowledge utilization or cross-checking and report consolidation and management roles. By coordinating these agents towards a common objective, the framework provides a comprehensive and automated approach for validating and interpreting financial data anomalies. I analyse the S&P 500 index to demonstrate the framework's proficiency in enhancing the efficiency, accuracy and reduction of human intervention in financial market monitoring. The integration of AI's autonomous functionalities with established analytical methods not only underscores the framework's effectiveness in anomaly detection but also signals its broader applicability in supporting financial market monitoring.
LLMs as Academic Reading Companions: Extending HCI Through Synthetic Personae
分类: 人机交互
作者: Celia Chen, Alex Leitch
发布时间: 2024-03-28
链接: http://arxiv.org/abs/2403.19506v1
摘要: This position paper argues that large language models (LLMs) constitute promising yet underutilized academic reading companions capable of enhancing learning. We detail an exploratory study examining Claude.ai from Anthropic, an LLM-based interactive assistant that helps students comprehend complex qualitative literature content. The study compares quantitative survey data and qualitative interviews assessing outcomes between a control group and an experimental group leveraging Claude.ai over a semester across two graduate courses. Initial findings demonstrate tangible improvements in reading comprehension and engagement among participants using the AI agent versus unsupported independent study. However, there is potential for overreliance and ethical considerations that warrant continued investigation. By documenting an early integration of an LLM reading companion into an educational context, this work contributes pragmatic insights to guide development of synthetic personae supporting learning. Broader impacts compel policy and industry actions to uphold responsible design in order to maximize benefits of AI integration while prioritizing student wellbeing.
LLMSense: Harnessing LLMs for High-level Reasoning Over Spatiotemporal Sensor Traces
分类: 人工智能
作者: Xiaomin Ouyang, Mani Srivastava
发布时间: 2024-03-28
链接: http://arxiv.org/abs/2403.19857v1
摘要: Most studies on machine learning in sensing systems focus on low-level perception tasks that process raw sensory data within a short time window. However, many practical applications, such as human routine modeling and occupancy tracking, require high-level reasoning abilities to comprehend concepts and make inferences based on long-term sensor traces. Existing machine learning-based approaches for handling such complex tasks struggle to generalize due to the limited training samples and the high dimensionality of sensor traces, necessitating the integration of human knowledge for designing first-principle models or logic reasoning methods. We pose a fundamental question: Can we harness the reasoning capabilities and world knowledge of Large Language Models (LLMs) to recognize complex events from long-term spatiotemporal sensor traces? To answer this question, we design an effective prompting framework for LLMs on high-level reasoning tasks, which can handle traces from the raw sensor data as well as the low-level perception results. We also design two strategies to enhance performance with long sensor traces, including summarization before reasoning and selective inclusion of historical traces. Our framework can be implemented in an edge-cloud setup, running small LLMs on the edge for data summarization and performing high-level reasoning on the cloud for privacy preservation. The results show that LLMSense can achieve over 80% accuracy on two high-level reasoning tasks such as dementia diagnosis with behavior traces and occupancy tracking with environmental sensor traces. This paper provides a few insights and guidelines for leveraging LLM for high-level reasoning on sensor traces and highlights several directions for future work.
"I'm categorizing LLM as a productivity tool": Examining ethics of LLM use in HCI research practices
分类: 人机交互
作者: Shivani Kapania, Ruiyi Wang, Toby Jia-Jun Li, Tianshi Li, Hong Shen
发布时间: 2024-03-28
链接: http://arxiv.org/abs/2403.19876v1
摘要: Large language models are increasingly applied in real-world scenarios, including research and education. These models, however, come with well-known ethical issues, which may manifest in unexpected ways in human-computer interaction research due to the extensive engagement with human subjects. This paper reports on research practices related to LLM use, drawing on 16 semi-structured interviews and a survey conducted with 50 HCI researchers. We discuss the ways in which LLMs are already being utilized throughout the entire HCI research pipeline, from ideation to system development and paper writing. While researchers described nuanced understandings of ethical issues, they were rarely or only partially able to identify and address those ethical concerns in their own projects. This lack of action and reliance on workarounds was explained through the perceived lack of control and distributed responsibility in the LLM supply chain, the conditional nature of engaging with ethics, and competing priorities. Finally, we reflect on the implications of our findings and present opportunities to shape emerging norms of engaging with large language models in HCI research.
LLMs in HCI Data Work: Bridging the Gap Between Information Retrieval and Responsible Research Practices
分类: 人机交互, 信息检索
作者: Neda Taghizadeh Serajeh, Iman Mohammadi, Vittorio Fuccella, Mattia De Rosa
发布时间: 2024-03-27
链接: http://arxiv.org/abs/2403.18173v1
摘要: Efficient and accurate information extraction from scientific papers is significant in the rapidly developing human-computer interaction research in the literature review process. Our paper introduces and analyses a new information retrieval system using state-of-the-art Large Language Models (LLMs) in combination with structured text analysis techniques to extract experimental data from HCI literature, emphasizing key elements. Then We analyze the challenges and risks of using LLMs in the world of research. We performed a comprehensive analysis on our conducted dataset, which contained the specified information of 300 CHI 2020-2022 papers, to evaluate the performance of the two large language models, GPT-3.5 (text-davinci-003) and Llama-2-70b, paired with structured text analysis techniques. The GPT-3.5 model gains an accuracy of 58% and a mean absolute error of 7.00. In contrast, the Llama2 model indicates an accuracy of 56% with a mean absolute error of 7.63. The ability to answer questions was also included in the system in order to work with streamlined data. By evaluating the risks and opportunities presented by LLMs, our work contributes to the ongoing dialogue on establishing methodological validity and ethical guidelines for LLM use in HCI data work.
Exploring the Deceptive Power of LLM-Generated Fake News: A Study of Real-World Detection Challenges
分类: 计算和语言, 社交和信息网络
作者: Yanshen Sun, Jianfeng He, Limeng Cui, Shuo Lei, Chang-Tien Lu
发布时间: 2024-03-27
链接: http://arxiv.org/abs/2403.18249v1
摘要: Recent advancements in Large Language Models (LLMs) have enabled the creation of fake news, particularly in complex fields like healthcare. Studies highlight the gap in the deceptive power of LLM-generated fake news with and without human assistance, yet the potential of prompting techniques has not been fully explored. Thus, this work aims to determine whether prompting strategies can effectively narrow this gap. Current LLM-based fake news attacks require human intervention for information gathering and often miss details and fail to maintain context consistency. Therefore, to better understand threat tactics, we propose a strong fake news attack method called conditional Variational-autoencoder-Like Prompt (VLPrompt). Unlike current methods, VLPrompt eliminates the need for additional data collection while maintaining contextual coherence and preserving the intricacies of the original text. To propel future research on detecting VLPrompt attacks, we created a new dataset named VLPrompt fake news (VLPFN) containing real and fake texts. Our experiments, including various detection methods and novel human study metrics, were conducted to assess their performance on our dataset, yielding numerous findings.
Can LLMs Converse Formally? Automatically Assessing LLMs in Translating and Interpreting Formal Specifications
分类: 计算和语言, 人工智能
作者: Rushang Karia, Daksh Dobhal, Daniel Bramblett, Pulkit Verma, Siddharth Srivastava
发布时间: 2024-03-27
链接: http://arxiv.org/abs/2403.18327v1
摘要: Stakeholders often describe system requirements using natural language which are then converted to formal syntax by a domain-expert leading to increased design costs. This paper assesses the capabilities of Large Language Models (LLMs) in converting between natural language descriptions and formal specifications. Existing work has evaluated the capabilities of LLMs in generating formal syntax such as source code but such experiments are typically hand-crafted and use problems that are likely to be in the training set of LLMs, and often require human-annotated datasets. We propose an approach that can use two copies of an LLM in conjunction with an off-the-shelf verifier to automatically evaluate its translation abilities without any additional human input. Our approach generates formal syntax using language grammars to automatically generate a dataset. We conduct an empirical evaluation to measure the accuracy of this translation task and show that SOTA LLMs cannot adequately solve this task, limiting their current utility in the design of complex systems.
Rejection Improves Reliability: Training LLMs to Refuse Unknown Questions Using RL from Knowledge Feedback
分类: 计算和语言
作者: Hongshen Xu, Zichen Zhu, Da Ma, Situo Zhang, Shuai Fan, Lu Chen, Kai Yu
发布时间: 2024-03-27
链接: http://arxiv.org/abs/2403.18349v1
摘要: Large Language Models (LLMs) often generate erroneous outputs, known as hallucinations, due to their limitations in discerning questions beyond their knowledge scope. While addressing hallucination has been a focal point in research, previous efforts primarily concentrate on enhancing correctness without giving due consideration to the significance of rejection mechanisms. In this paper, we conduct a comprehensive examination of the role of rejection, introducing the notion of model reliability along with corresponding metrics. These metrics measure the model's ability to provide accurate responses while adeptly rejecting questions exceeding its knowledge boundaries, thereby minimizing hallucinations. To improve the inherent reliability of LLMs, we present a novel alignment framework called Reinforcement Learning from Knowledge Feedback (RLKF). RLKF leverages knowledge feedback to dynamically determine the model's knowledge boundary and trains a reliable reward model to encourage the refusal of out-of-knowledge questions. Experimental results on mathematical questions affirm the substantial efficacy of RLKF in significantly enhancing LLM reliability.
FoC: Figure out the Cryptographic Functions in Stripped Binaries with LLMs
分类: 密码学和安全
作者: Guoqiang Chen, Xiuwei Shang, Shaoyin Cheng, Yanming Zhang, Weiming Zhang, Nenghai Yu
发布时间: 2024-03-27
链接: http://arxiv.org/abs/2403.18403v1
摘要: Analyzing the behavior of cryptographic functions in stripped binaries is a challenging but essential task. Cryptographic algorithms exhibit greater logical complexity compared to typical code, yet their analysis is unavoidable in areas such as virus analysis and legacy code inspection. Existing methods often rely on data or structural pattern matching, leading to suboptimal generalizability and suffering from manual work. In this paper, we propose a novel framework called FoC to Figure out the Cryptographic functions in stripped binaries. In FoC, we first build a binary large language model (FoCBinLLM) to summarize the semantics of cryptographic functions in natural language. The prediction of FoC-BinLLM is insensitive to minor changes, such as vulnerability patches. To mitigate it, we further build a binary code similarity model (FoC-Sim) upon the FoC-BinLLM to create change-sensitive representations and use it to retrieve similar implementations of unknown cryptographic functions in a database. In addition, we construct a cryptographic binary dataset for evaluation and to facilitate further research in this domain. And an automated method is devised to create semantic labels for extensive binary functions. Evaluation results demonstrate that FoC-BinLLM outperforms ChatGPT by 14.61% on the ROUGE-L score. FoC-Sim outperforms the previous best methods with a 52% higher Recall@1. Furthermore, our method also shows practical ability in virus analysis and 1-day vulnerability detection.
SDSAT: Accelerating LLM Inference through Speculative Decoding with Semantic Adaptive Tokens
分类: 计算和语言
作者: Chengbo Liu, Yong Zhu
发布时间: 2024-03-27
链接: http://arxiv.org/abs/2403.18647v1
摘要: We propose an acceleration scheme for large language models (LLMs) through Speculative Decoding with Semantic Adaptive Tokens (SDSAT). The primary objective of this design is to enhance the LLM model's ability to generate draft tokens more accurately without compromising the model's accuracy. The core strategies involve: 1) Fine-tune the model by incorporating semantic adaptive tokens that possess flexible decoding capabilities without changing its structure, allowing them to generate high-quality draft tokens. 2) By employing a training method that does not affect the standard tokens, the model can acquire parallel decoding abilities atop its original framework with minimal training overhead. 3) We have designed the "two-step-draft-then-verify" generation strategies using both greedy search and nucleus sampling. Experiments conducted on the CodeLlama-13B and 7B models have yielded speed increases of over 3.5X and 3.0X, respectively. Please refer to https://github.com/hasuoshenyun/SDSAT.
PhysicsAssistant: An LLM-Powered Interactive Learning Robot for Physics Lab Investigations
分类: 机器人技术
作者: Ehsan Latif, Ramviyas Parasuraman, Xiaoming Zhai
发布时间: 2024-03-27
链接: http://arxiv.org/abs/2403.18721v1
摘要: Robot systems in education can leverage Large language models' (LLMs) natural language understanding capabilities to provide assistance and facilitate learning. This paper proposes a multimodal interactive robot (PhysicsAssistant) built on YOLOv8 object detection, cameras, speech recognition, and chatbot using LLM to provide assistance to students' physics labs. We conduct a user study on ten 8th-grade students to empirically evaluate the performance of PhysicsAssistant with a human expert. The Expert rates the assistants' responses to student queries on a 0-4 scale based on Bloom's taxonomy to provide educational support. We have compared the performance of PhysicsAssistant (YOLOv8+GPT-3.5-turbo) with GPT-4 and found that the human expert rating of both systems for factual understanding is the same. However, the rating of GPT-4 for conceptual and procedural knowledge (3 and 3.2 vs 2.2 and 2.6, respectively) is significantly higher than PhysicsAssistant (p < 0.05). However, the response time of GPT-4 is significantly higher than PhysicsAssistant (3.54 vs 1.64 sec, p < 0.05). Hence, despite the relatively lower response quality of PhysicsAssistant than GPT-4, it has shown potential for being used as a real-time lab assistant to provide timely responses and can offload teachers' labor to assist with repetitive tasks. To the best of our knowledge, this is the first attempt to build such an interactive multimodal robotic assistant for K-12 science (physics) education.
Towards LLM-RecSys Alignment with Textual ID Learning
分类: 信息检索, 人工智能, 计算和语言, 机器学习
作者: Juntao Tan, Shuyuan Xu, Wenyue Hua, Yingqiang Ge, Zelong Li, Yongfeng Zhang
发布时间: 2024-03-27
链接: http://arxiv.org/abs/2403.19021v1
摘要: Generative recommendation based on Large Language Models (LLMs) have transformed the traditional ranking-based recommendation style into a text-to-text generation paradigm. However, in contrast to standard NLP tasks that inherently operate on human vocabulary, current research in generative recommendations struggles to effectively encode recommendation items within the text-to-text framework using concise yet meaningful ID representations. To better align LLMs with recommendation needs, we propose IDGen, representing each item as a unique, concise, semantically rich, platform-agnostic textual ID using human language tokens. This is achieved by training a textual ID generator alongside the LLM-based recommender, enabling seamless integration of personalized recommendations into natural language generation. Notably, as user history is expressed in natural language and decoupled from the original dataset, our approach suggests the potential for a foundational generative recommendation model. Experiments show that our framework consistently surpasses existing models in sequential recommendation under standard experimental setting. Then, we explore the possibility of training a foundation recommendation model with the proposed method on data collected from 19 different datasets and tested its recommendation performance on 6 unseen datasets across different platforms under a completely zero-shot setting. The results show that the zero-shot performance of the pre-trained foundation model is comparable to or even better than some traditional recommendation models based on supervised training, showing the potential of the IDGen paradigm serving as the foundation model for generative recommendation. Code and data are open-sourced at https://github.com/agiresearch/IDGenRec.
Accelerating Radio Spectrum Regulation Workflows with Large Language Models (LLMs)
分类: 网络和互联网架构, 人工智能
作者: Amir Ghasemi, Paul Guinand
发布时间: 2024-03-26
链接: http://arxiv.org/abs/2403.17819v1
摘要: Wireless spectrum regulation is a complex and demanding process due to the rapid pace of technological progress, increasing demand for spectrum, and a multitude of stakeholders with potentially conflicting interests, alongside significant economic implications. To navigate this, regulators must engage effectively with all parties, keep pace with global technology trends, conduct technical evaluations, issue licenses in a timely manner, and comply with various legal and policy frameworks. In light of these challenges, this paper demonstrates example applications of Large Language Models (LLMs) to expedite spectrum regulatory processes. We explore various roles that LLMs can play in this context while identifying some of the challenges to address. The paper also offers practical case studies and insights, with appropriate experiments, highlighting the transformative potential of LLMs in spectrum management.
Verbing Weirds Language (Models): Evaluation of English Zero-Derivation in Five LLMs
分类: 计算和语言
作者: David R. Mortensen, Valentina Izrailevitch, Yunze Xiao, Hinrich Schütze, Leonie Weissweiler
发布时间: 2024-03-26
链接: http://arxiv.org/abs/2403.17856v1
摘要: Lexical-syntactic flexibility, in the form of conversion (or zero-derivation) is a hallmark of English morphology. In conversion, a word with one part of speech is placed in a non-prototypical context, where it is coerced to behave as if it had a different part of speech. However, while this process affects a large part of the English lexicon, little work has been done to establish the degree to which language models capture this type of generalization. This paper reports the first study on the behavior of large language models with reference to conversion. We design a task for testing lexical-syntactic flexibility -- the degree to which models can generalize over words in a construction with a non-prototypical part of speech. This task is situated within a natural language inference paradigm. We test the abilities of five language models -- two proprietary models (GPT-3.5 and GPT-4), three open-source models (Mistral 7B, Falcon 40B, and Llama 2 70B). We find that GPT-4 performs best on the task, followed by GPT-3.5, but that the open source language models are also able to perform it and that the 7B parameter Mistral displays as little difference between its baseline performance on the natural language inference task and the non-prototypical syntactic category task, as the massive GPT-4.
Exploring LLMs as a Source of Targeted Synthetic Textual Data to Minimize High Confidence Misclassifications
分类: 计算和语言
作者: Philip Lippmann, Matthijs Spaan, Jie Yang
发布时间: 2024-03-26
链接: http://arxiv.org/abs/2403.17860v1
摘要: Natural Language Processing (NLP) models optimized for predictive performance often make high confidence errors and suffer from vulnerability to adversarial and out-of-distribution data. Existing work has mainly focused on mitigation of such errors using either humans or an automated approach. In this study, we explore the usage of large language models (LLMs) for data augmentation as a potential solution to the issue of NLP models making wrong predictions with high confidence during classification tasks. We compare the effectiveness of synthetic data generated by LLMs with that of human data obtained via the same procedure. For mitigation, humans or LLMs provide natural language characterizations of high confidence misclassifications to generate synthetic data, which are then used to extend the training set. We conduct an extensive evaluation of our approach on three classification tasks and demonstrate its effectiveness in reducing the number of high confidence misclassifications present in the model, all while maintaining the same level of accuracy. Moreover, we find that the cost gap between humans and LLMs surpasses an order of magnitude, as LLMs attain human-like performance while being more scalable.
MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution
分类: 软件工程, 人工智能
作者: Wei Tao, Yucheng Zhou, Wenqiang Zhang, Yu Cheng
发布时间: 2024-03-26
链接: http://arxiv.org/abs/2403.17927v1
摘要: In software evolution, resolving the emergent issues within GitHub repositories is a complex challenge that involves not only the incorporation of new code but also the maintenance of existing functionalities. Large Language Models (LLMs) have shown promise in code generation and understanding but face difficulties in code change, particularly at the repository level. To overcome these challenges, we empirically study the reason why LLMs mostly fail to resolve GitHub issues and analyze some impact factors. Motivated by the empirical findings, we propose a novel LLM-based Multi-Agent framework for GitHub Issue reSolution, MAGIS, consisting of four kinds of agents customized for the software evolution: Manager, Repository Custodian, Developer, and Quality Assurance Engineer agents. This framework leverages the collaboration of various agents in the planning and coding process to unlock the potential of LLMs to resolve GitHub issues. In experiments, we employ the SWE-bench benchmark to compare MAGIS with popular LLMs, including GPT-3.5, GPT-4, and Claude-2. MAGIS can resolve 13.94% GitHub issues, which significantly outperforms the baselines. Specifically, MAGIS achieves an eight-fold increase in resolved ratio over the direct application of GPT-4, the based LLM of our method. We also analyze the factors for improving GitHub issue resolution rates, such as line location, task allocation, etc.
不要信任:验证——用自动形式化奠定大语言模型定量推理的基础
分类: 人工智能, 计算和语言, 机器学习
作者: Jin Peng Zhou, Charles Staats, Wenda Li, Christian Szegedy, Kilian Q. Weinberger, Yuhuai Wu
发布时间: 2024-03-26
链接: http://arxiv.org/abs/2403.18120v1
摘要: 大型语言模型 (LLM),例如 Google 的 Minerva 和 OpenAI 的 GPT 系列,解决数学定量推理问题的能力越来越强。然而,他们在推理步骤和答案中仍然会犯不合理的逻辑和计算错误。在本文中,我们利用这样一个事实:如果大语言模型的训练语料库包含足够多的形式数学示例(例如在伊莎贝尔,一个形式定理证明环境),则可以提示他们将非正式数学语句自动形式化为正式的伊莎贝尔代码 - - 可以自动验证内部一致性。这提供了一种自动拒绝其形式化版本本身不一致或与形式化问题陈述不一致的解决方案的机制。我们在 GSM8K、MATH 和 MultiArith 数据集上评估我们的方法,并证明我们的方法提供了比普通多数投票更好的启发式方法(之前识别正确答案的最佳方法),在 GSM8K 上提高了 12% 以上。在我们的实验中,它一致地改善了所有数据集和 LLM 模型大小的结果。代码可以在 https://github.com/jinpz/dtv 找到。