当前位置: 首页>博士论文>资源详情
基于知识整合的词汇语义相似度计算方法研究
中文摘要

随着大数据时代的到来,海量的文本数据在提供高价值信息的同时,也给文本语义理解带来了严峻的挑战。单词是文本的最小组成单元,其语义相似度是挖掘词汇关联的重要依据,有助于计算机准确理解语句和文档的内容。根据词汇语义资源,典型的语义相似度计算方法包含两类:基于知识库(Knowledge Base)和基于语料库(Corpus)。知识库能够提供词汇的语义描述和结构化信息,但是严重依赖于领域专家的构建和维护,词汇覆盖率较低,缺乏可扩展性。而语料库虽然包含丰富的词汇,但是其非结构性导致难以从中提取词汇的有效语义特征。为了克服单类语义资源的不足,本文基于WordNet的图结构和词汇的低维向量表示,分别从概念信息含量的量化模型、语义增强的词向量、度量方法的优化组合三方面,研究了知识库和语料库中语义知识的整合。本文的主要研究成果如下: (1)提出了一种基于IC加权最短路径的概念语义相似度计算方法CSSM-ICSP(Concept Semantic Similarity Measurement Based on IC-weighted Shortest Path)。该方法利用WordNet中概念的边长、深度、密度等结构属性以及信息含量(Information Content,IC),计算概念之间的路径距离并非线性地转化为概念语义相似度。首先,用概念深度的相关函数对概念密度进行平滑,构造基于WordNet的固有IC混合(Intrinsic IC Hybrid,IIH)计算模型,该模型改进了传统IC计算模型未考虑概念深度的不足;其次,将概念的IC差值作为边长的权重,衡量处于不同深度的概念语义关系的强度差异。利用IC加权的路径距离、深度差异率和归一化路径距离,建立概念距离计算模型。此外,为了实现WordNet与语料的语义知识整合,该方法将基于语料的统计IC模型引入固有IC模型。实验结果表明,在词对相似度公共测试集M&C、R&G、WS-353和WS-sim上,相比基于WordNet的其它方法,本文提出的方法具有较高的皮尔森线性相关系数。 (2)提出了一种基于多语义融合的单词语义相似度计算方法WSSM-MSF(Word Semantic Similarity Measurement Based on Multiple Semantic Fusion)。该方法旨在构建有效的词汇语义表征,改善基于向量空间的语义相似性度量。由于文档所表达的语义内容可以表示为句子、短语或单词的向量组合,该方法基于向量的代数运算,利用WordNet中概念的多个语义属性,包括同义词(Synset)、注释(Gloss)、上位词(Hypemym)和下位词(Hyponym),构造多语义融合(Mul邱le Semantic Fusion,MSF)模型,以此生成概念向量和语义增强词向量,实现基于语义特征的异构知识整合。为了避免传统词袋模型带来的数据稀疏、特征高维等问题,该方法采用连续词袋模型CBOW(Continuous Bag-of-Words),从大规模文本语料中学习出低维、稠密的实数词向量。实验结果表明,本文提出的语义增强的词向量相对于原始词向量具有更好的表示语义特征的能力,能够提高词对相似度评测的计算准确度,以及语义Web服务匹配的查准率和召回率。 (3)提出了一种基于差分进化(Differential Evolutionary,DE)算法的单词语义相似度计算方法WSSM-DE(Word Semantic Similarity Measurement Based on Differential Evolutionary)。该方法将多种度量方法的优化组合问题演化为解空间中的随机寻优过程,将基于WordNet或基于低维向量的计算语义相似度作为DE算法中种群个体的多维分量,通过基于个体差异的启发式全局搜索,获得分量上的权值和最优解,以此实现WordNet与语料的语义知识整合。基于最优个体的每一维分量值的变化,分析了在语义计算任务中词向量可能隶属的空间。在词对相似度评测任务上的实验结果表明,本文提出的方法不仅优于基于单一语义源的相似度计算方法,而且优于基于有监督优化组合的计算方法,包括基于排序学习的计算方法和基于回归的计算方法。尤其是将语义增强的词向量应用于该方法中,语义相似度计算的准确度具有明显的提升。 综上所述,与已有的基于单类资源的相似度计算方法相比,本文提出的三种方法均侧重于整合异构资源的语义信息,提升词汇语义相似度计算的性能。其适用性取决于可用语义资源的种类、规模和评测任务。 关键词:知识整合;语义相似度计算;IC量化模型;语义增强;低维词向量;差分进化 分类号:TP391

英文摘要

With the advent of the era of big data, a huge mass of textual data provide valuable information, however, also cause lots of tough challenges. Word is the basic component unit of text, so lexical semantic similarity measurement plays an important role in mining the association of words, and also enables computers to understand sentences and documents accurately. In terms of lexical semantic resources, the measurement methods of semantic similarity are mainly classified into knowledge base-based and corpus-based. Knowledge bases provide lexical semantic description and structured information; however, they depend heavily on artificial experiences for construction and maintaining and have low lexical coverage and extensibility. While a corpus commonly contains a copious vocabulary, it is hard to distill it down to just exactly the semantic features for representing words. To overcome the limitation of single resources used in semantic similarity measurement, in this dissertation, we focus on the graphical structure of WordNet and low-dimensional word vector representation, and study how to integrate the semantic knowledge derived from knowledge base and corpus, in terms of IC computational model, semantic-augmented word vector and combinational optimization of methods. The main contributions of this dissertation are listed as follows: (1)It presents a concept semantic similarity measurement based on IC-weighted shortest path (CSSM-ICSP) which aims to nonlinearly transfer path distance between concepts to the semantic similarity. This method adopts various structure properties of concepts such as edge length, depth and density, as well as the information content (IC) of concepts. Firstly, we build the Intrinsic IC Hybrid (IIH) model which smooth concept density by depth-related nonlinear function to address the lack of concept depth in traditional IC computational models. Secondly, each edge between concepts is weighted by the difference of IC values of concepts, which reflects non-uniform intensity of semantic relationships of concepts with different depths in hierarchy. Then we compute the IC-weighted path distance, depth overlap and normalized path distance for obtaining a new calculation model of path distance. In addition, we introduce the hybrid computation of intrinsic IC value and statistical IC value into similarity measurement, which realize the integration of semantic knowledge from WordNet and corpus. The experiments are conducted on public benchmark datasets which consist of M&C, R&G, WS-353 and WS-sim. The experimental results show that compared against other WordNet-based measurement methods, the proposed methods reach higher Pearson coefficient. (2)It presents a word semantic similarity measurement based on multiple semantic fusion (WSSM-MSF) which aims to improve the vector space based measurement by means of effective lexical semantic representation. Considering the semantic content of document can be represented by the vector composition between sentences, phrases or words, we build a multiple semantic fusion (MSF) model based on the algebraic operations of vectors of multiple semantic properties in WordNet, including synsets, glosses, hypernyms and hyponyms. In this way, the MSF model generates concept vectors and semantic-augmented word vectors and implements the integration of heterogeneous knowledge based on semantic features. In order to avoid the problems of data sparse and high-dimension feature, we use the continuous Bag-of-Words (CBOW) based on neural network to learn low-dimensional, dense and real-valued word embedding from large-scale corpus. Experimental results show that the semantic-augmented word vectors improve the expression capability of original word vectors, and enable the performance improvement in both word similarity evaluation and semantic-oriented Web service matching, in term of accuracy, precision and recall. (3)It presents a word semantic similarity measurement based on differential evolutionary (WSSM-DE) which takes the optimizing combination of various measurement methods as the stochastic optimization process in solution space. This novel method represents each dimension of an individual in DE algorithm as WordNet-based computational methods or the computational methods based on low-dimensional word vector. Then the optimal weighting of each dimension and the optimum solution are produced by heuristic global search based on individual differences, which realizes the semantic knowledge integration of WordNet and corpus. Then we analyze the spaces where the word vectors distributed on basis of the change of weighting values of each dimension. According to the experimental results on word similarity evaluation, the proposed method outperforms others based on supervised learning algorithms, including Learning to Rank (LTR) and regression. And it also improves the accuracy of measurement on a single type of semantic resource. Specially, the combination of the proposed method with semantic-augmented word vector has significant improvement in term of the measurement accuracy. In summary, compared against other methods, the three proposed semantic similarity computational methods focus on integrating the semantic information derived from heterogeneous resources to improve lexical semantic similarity measurement. Their applicability depends on the type and scale of semantic resources as well as the type of evaluation tasks. KEYWORDS: Knowledge integration; Semantic similarity measurement; IC quantitative model; Semantic augmentation; Low-dimensional word embedding; Differential evolution CLASSNO: TP391

作者相关
主题相关
看过该书的人还在看哪些书