Publications

Projects

    Outline

  1. Label Distribution Learning
  2. Personal Visual Attribute Recognition
  3. Learning with weak supervision
  4. Multiple Kernel Learning and Extension Based On General Kernel
  5. Open Information Extraction
  6. Unequal Data Learning Based on Imprecise Information

Label Distribution Learning

In a particular application, the importance of the different meanings of an ambiguous object is often different. The current framework of multi-label learning cannot match this phenomenon well. To deal with such situation, label distribution learning is proposed as a new learning framework, where an instance is not labeled by a label set, but by a label distribution. For each label in a label distribution, there is a real number called description degree, which represents the importance of the corresponding label. Compared with multi-label learning, label distribution learning is more general, flexible, and expressive. It is a fresh try to solve the ambiguity problem in learning. Moreover, both single-label and multi-label learning can be viewed as special cases of label distribution learning. Thus the research on label distribution learning might help to solve the problems in single-label and multi-label learning. This project is one of the first explorations of label distribution learning, which will study its theory, algorithms, applications, and uses in learning from crowds and the utilization of label correlation. This project endeavors to propose a basic theoretical framework and several algorithms for label distribution learning, and prove their values in real applications.

Personal Visual Attribute Recognition

1.Age Estimation

2.Face Recognition

3.Head Pose Estimation

Learning with weak supervision

Supervision information encodes semantics and regularities on the learning problem to be addressed, and thus plays a key role for the success of many learning systems.Traditional supervised learning methods adopt the strong supervision assumption, i.e. training examples are supposed to carry sufficient and explicit supervision information to induce prediction models with good generalization ability. However, due to the various constraints imposed by physical environment, problem characteristics, and resource limitations, it is difficult and even infeasible to have strong supervision information in many real-world applications.

Currently, we’re interested in studying several machine learning frameworks which learn from various weakly-supervised information, including semi-supervised learning, multi-label learning, partial label learning, etc. This series of works are supported by National Science Foundation of China (NSFC), Ministry of Education of China, etc.

Multiple Kernel Learning and Extension Based On General Kernel

Kernel method is a powerful statistical learning technique in machine learning, which has been widely applied in many research fields, such as classification, regression and clustering. Kernel selection is the key content in kernel method, since it is an important part to improve the generalization performance of kernel method. Multiple kernel learning utilizes a combination of multiple basic kernels instead of a single one and thus converts the problem of kernel selection into the choice of combination coefficients, which improves kernel method effectively. However, most of multiple kernel learning methods are confined to the convex combination of positive definite kernels, which leads to the limitations in their performance and applications. Generalized kernel is an emerging kernel class that breaks the constraints on the positive definiteness of the kernels and shows much better empirical classification results. However, existing generalized kernel methods are based on a single kernel which also limits their performance improvements. Consequently, the combination of generalized kernel method with multiple kernel learning is a new thought to improve the generalization performance of kernel method. However, the non-convexity of the generalized kernels and the multi-parametric property of multiple kernel learning present some challenges in this non-trivial combination. The project aims to develop a general framework for multiple kernel classification learning based on generalized kernel and design a series of effective algorithms on the basis of sufficient consideration about the characteristics in the two kinds of kernel methods, in order to overcome their existing deficiencies. The corresponding research emphasizes on model construction, non-convex optimization, generalization performance analysis and experimental comparison, and further expands the framework to some other learning and application fields.

Open Information Extraction

Knowledge discovery is a process of finding validated knowledge from data base. Knowledge discovery based on Web uses all Webs as information source to obtain understandable pattern. Information Extraction (IE), a significant step of knowledge discovery based on web, is the way to process the underlying information in text and output organized data in a certain form.

Try to imagine a scene that when we input “the relations between North and South Korea” or “the stock market” in search engine, each change of relations between two Koreas and the chart of The Nasdaq stock market instead of billions of Webs are clearly and directly provided to us .To achieve this goal ,the progress of IE is essential. Conventional information extraction pays more attention on relations that have been defined such as the information of flight taking off and landing. It starts from contentd Entity Recognition (NER) and now it focuses on complicated relation extraction. Open-IE is not limit to the concrete extraction task so it over comes the limit that the relation must be defined in the conventional IE and can implement that extract relations without any limit. Apparently, Open-IE is more suitable for the Web environment because system can not predict the concerned relations of users .

The Open-IE frame we proposed contains four modules. They are training set preparation module, statistical model constructing module based on abstract annotation and weakly supervised learning , IE module and verification and inference module. Firstly, filtrate the Web text according to the entity relationship database, semantic ontology in the presence of the entity, entity relationship, relationship of professional knowledge in the database Wikipedia in contentd entity. Then cluster the processed sentences by an improved KNN algorithm to get sentence category and provide every category an abstract annotation. This abstract annotation describes the semantic information of the category. Clusters and their corresponding abstract annotation combine to a training set and based on the training set, construct a statistical model by learning algorithm. After that, extractor can do Open-IE on Web text need to process based on the statistical model and output information as system output ,integrating with new entity relations and inserting into ontology , semantic database and knowledge database. At last ,validate the information with knowledge source and update knowledge source as the next round input with the Web text.

Unequal Data Learning Based on Imprecise Information

In real-world applications, misclassification costs or data distributions are often unequal, which violates the assumptions of standard machine learning algorithms.Unequal misclassification costs. Misclassification errors often have different costs. For example, in medical diagnosis, the cost of wrongly predicting a patient having a cancer as a health man is much greater than the cost of wrongly predicting a health man as a patient, because the former may cost a life. Considering unequal costs, cost-sensitive learning aims at minimize total cost instead of error rate.

Unequal data distribution. Some classes have much fewer data than the other classes, and the minority class examples are more important. For example, in face detection, an image contains lots of windows with different sizes and at different locations, and the task is to determine which windows containing any face. There are often very few face-windows. The number of non-face-windows can be as large as 10000 times more than the number of face-windows. Obviously, the minority class data – the face-windows – are more important. Therefore, accuracy is not appropriate for evaluation anymore in this situation. In the face detection example, the accuracy of the trivial classifier always predicting non-face-window is 99.99%, but it is useless. When data distributions are unbalanced, class-imbalance learning assumes minority class data are more important, and uses F-measure, AUC, G-mean for evaluation.

Current machine learning algorithms assume that the costs or data distributions can be precisely described. But there are many aspects leading to imprecise costs or data distributions. Currently, we’re interested in studying learning form imprecise costs and imprecise data distributions caused by the ambiguity of multi-label. This series of works are supported by National Science Foundation of China (NSFC), Ministry of Education of China, etc.

    目录

  1. 标记分布学习
  2. 个人可视属性识别
  3. 弱监督学习
  4. 基于广义核的多核学习及拓展研究
  5. 开放的信息抽取
  6. 适于非精确信息的数据不均衡学习技术的研究

标记分布学习

对一个特定的应用来说,一个多义性对象不同含义的重要性往往是不同的,而现有的多标记学习框架不能很好的匹配这一现象。标记分布学习是为此提出的一种新型学习框架。在该框架内,一个示例不是由一个标记集,而是由一个标记分布来标注。标记分布中对应每个标记都有一个表示其重要性的实数值,称为描述度。标记分布学习相较多标记学习展现出更多的一般性、灵活性和表达力,是解决多义性问题的一种崭新尝试。并且由于单标记和多标记学习都可以看作标记分布学习的特例,对标记分布学习的研究也有可能对解决单标记和多标记学习中的问题起到启发和促进作用。本项目是对标记分布学习最初的探索之一,将研究其基础理论、算法、应用,以及其在群学习和利用标记间相关性方面的作用。本项目预期将建立一个标记分布学习基本理论框架,提出若干标记分布学习算法,并通过实际应用证明其价值。

个人可视属性识别

1.年龄估计

2.人脸识别

3.头部姿势估计

弱监督学习

监督信息(标记信息)蕴含了学习问题的语义和规律,对于设计成功的机器学习系统具有至关重要的作用。传统监督学习采用“强监督”假设,即训练集中的样本包含充分、明确的监督信息,用以构造具有强泛化能力的预测模型。然而,受物理环境、问题特性以及有限资源等因素的制约,在真实世界问题中强监督假设往往难以满足。

本项目组目前围绕几种典型的弱监督机器学习框架开展研究,包括半监督学习(semi-supervised learning)、多标记学习(multi-label learning)、偏标记学习(partial label learning)等。以上工作得到国家自然科学基金、教育部新世纪优秀人才支持计划等项目的资助。

基于广义核的多核学习及拓展研究

核方法是机器学习中一类强有力的统计学习技术,被广泛应用于分类、回归、聚类等诸多领域。核选择是核方法的关键内容,因其是提高核方法泛化性能的重要一环。多核学习通过利用多个基本核的组合代替单核,将核选择问题转化为对组合系数的选择,有效地改进了核方法。但是,因其大都局限于正定核的凸组合,导致了性能和应用受限。广义核是一类新兴的核,打破了对核的正定性约束,取得了很好的经验分类效果。然而,现有广义核方法均基于单核,亦限制了其性能提高。因此,将广义核方法和多核学习相结合,是提高核方法泛化性能的一种新思路。但是,广义核的非凸性和多核学习的多参数性,对这种非平凡的结合提出了挑战。本项目旨在充分考虑两者特性的基础上,发展出一个基于广义核的一般性多核分类学习框架并设计出一系列有效算法,以克服现有不足。重点针对模型构建、非凸优化求解、泛化性能分析、实验对比等方面展开研究,并将在多个学习和应用领域进行深层次拓展。

开放的信息抽取

知识发现是指从数据库中发现有用知识的整个过程。基于web的知识发现将整个web信息作为来源,获取可理解的模式。信息抽取,将文本里包含的信息进行处理,输出具有一定的组织形式的数据,是基于web的知识发现的重要一个环节。

想象一下,当我们在搜索引擎中输入“韩朝关系、股票市场”关键词时,历次韩朝关系变化以及国内A股,纳斯达克股票市场的涨跌的关系曲线清晰、直观的呈现在我们面前,而不是现在大部分搜索引擎给出的数以百万计的网页。实现这个目标离不开信息抽取技术的发展。传统的信息抽取关注于事先定义好的具体关系,如航空领域的航班起飞、降落信息,从最初的命名实体识别开始,到后来复杂的关系抽取。开放的信息抽取,不局限于具体的抽取任务,进而克服传统信息抽取需要指定具体关系的限制,达到不受限制的获取关系的目的。很显然,根据开头的设想,开放的信息抽取更贴近网络应用的需求,因为系统无法提前预知当前用户关注的关系。

我们提出的开放的信息抽取系统框架主要包括四个模块,分别为训练集准备模块、基于简标注和弱监督学习的统计模型构建模块、信息抽取模块以及验证、推理模块,如图1中描述。首先,以web文本、多知识源作为输入的匹配器根据语义数据库中的关系实体、本体中的存在关系的实体、专业知识数据库中的关系实体、维基百科中的命名实体,对web文本进行匹配过滤。对于过滤后的句子的集合,运用一个改进的k近邻聚类算法进行聚类,得到若干句子的聚类。为每一个聚类提供一个简标注(Abstract Annotation)。这个简标注刻画了这个聚类中句子中包含的语义信息。若干聚类和对应的简标注构成了训练集。接着基于这个特定的训练集,运用新的学习算法构建了一个统计模型。基于这个统计模型,抽取器对需要处理的web文本进行开放的信息抽取。抽取的信息作为系统的输出,与系统的输入-多知识源进行信息集成,新的实体关系被加入到本体、语义数据库、知识库中。同时根据知识源中的关系信息对抽取的信息进行验证。最后更新的知识源作为输入和web文本一道进行下一下循环中。

适于非精确信息的数据不均衡学习技术的研究

在机器学习领域,数据不均衡性广泛存在于真实应用中,主要体现为误分类代价不等和样本分布不平衡,不符合标准机器学习的假设。

(1)误分类代价不等:不同的错误分类具有不同的代价。比如在医疗诊断中,将癌症患者漏诊的代价远大于将健康者误诊的代价,因为前者可能带来生命危险。考虑到误分类代价不等的学习称为代价敏感学习,旨在最小化总体错分代价而非错误率。


(2)样本分布不平衡:在分类任务中,有些类别的样本数明显少于其他类别,但是少数类样本更重要。比如,在人脸检测中,一副图像包含不同的位置和大小的很多窗口,人脸检测就是要辨别这些窗口中是否包含人脸。包含人脸的窗口所占的比例非常少,通常,包含人脸的窗口和不包含人脸的窗口数目比例小于1:10000。但是少数类样本—包含人脸的窗口—更重要。此时正确率不再适合作为评价准则,比如在人脸检测例子中,将所有窗口预测为不包含人脸窗口的分类器能达到99.99%的正确率,但它是无用的。当类别分布不平衡时,类别不平衡学习假设小类样本更重要时,以F-measure, AUC, G-mean作为的评价准则。


现有数据不均衡学习技术假设数据的不均衡性是精确刻画的,但多种因素会导致数据不均衡性难以精确刻画。我们目前主要针对领域知识无法给出精确的误分类代价值,以及多标记的歧义性导致样本分布不平衡性的不确定性进行研究。这些工作得到了国家自然科学基金(NSFC)以及中国教育部的支持。

Resources