首页 > 代码库 > 学习哈希及哈希在大数据检索和挖掘中的应用
学习哈希及哈希在大数据检索和挖掘中的应用
http://cs.nju.edu.cn/lwj/conf/CIKM14Hash.htm
Learning to Hash with its Application to Big Data Retrieval and Mining
Overview
Nearest neighbor (NN) search plays a fundamental role in machine learning and related areas, such as information retrieval and data mining. Hence, there has been increasing interest in NN search in massive (large-scale) data sets in this big data era. In many real applications, it‘s not necessary for an algorithm to return the exact nearest neighbors for every possible query. Hence, in recent years approximate nearest neighbor (ANN) search algorithms with improved speed and memory saving have received more and more attention from researchers.
【最近邻搜索(Nearest neighbor (NN) search)】在机器学习等相关领域扮演着重要的角色,例如【信息检索(information retrieval,[??nf??me??n r??triv?l])】和【数据挖掘(data mining,[?det? ?ma?n??])】。因此,在这个大数据时代,人们对【大规模数据(massive (large-scale) data sets)】的最近邻搜索越来越感兴趣。在很多实际应用中,所以用的算法没必要对于每一个可能的查询都返回确切的最近邻居。因此,最近几年,可以提高速度和节省空间的【近似最近邻搜索(approximate nearest neighbor (ANN) search)】算法已经受到来自研究者们跟多的关注。
Due to its low storage cost and fast query speed, hashing has been widely adopted for ANN search in large-scale datasets. The essential idea of hashing is to map the data points from the original feature space into binary codes in the hashcode space with similarities between pairs of data points preserved. The advantage of binary codes representation over the original feature vector representation is twofold. Firstly, each dimension of a binary code can be stored using only 1 bit while several bytes are typically required for one dimension of the original feature vector, leading to a dramatic reduction in storage cost. Secondly, by using binary codes representation, all the data points within a specific Hamming distance to a given query can be retrieved in constant or sub-linear time regardless of the total size of the dataset. Hence, hashing has become one of the most effective methods for big data retrieval and mining.
由于哈希的低存储耗费和高查询速度,它被广泛应用于大数据的近似最邻近搜索。哈希的基本思想是将原始特征空间的数据点映射成哈希码空间的二进制码,同时也保存了每一对数据点之间的相似性。二进制码的表示相对于原始特征向量的表示有两点优势。首先,每一个二进制码可以通过1bit来存储,而一个原始特征向量则需要几个byte来存储,导致了存储耗费的大幅减少。其次,通过使用二进制码来表示,对于一个给定的查询,所有的在特定的【汉明距离(Hamming distance)】内的数据点都能够在常量时间或分段线性时间内被检索到,而不管数据集的总的大小。因此,哈希已经成为大数据检索和挖掘最有效的方法之一了。
To get effective hashing codes, most methods adopt machine learning techniques for hashing function learning. Hence, learning to hash, which tries to design effective machine learning methods for hashing, has recently become a very hot research topic with wide applications in many big data areas. This tutorial will provide a systematic introduction of learning to hash, including the motivation, models, learning algorithms, and applications. Firstly, we will introduce the challenges faced by us when performing retrieval and mining with big data, which are used to well motivate the adoption of hashing. Secondly, we will give a comprehensive coverage of the foundations and recent developments on learning to hash, including unsupervised hashing, supervised hashing, multimodal hashing, etc. Thirdly, quantization methods, which are used to turn the real values into binary codes in many hashing methods, will be presented. Fourthly, a large variety of applications with hashing will also be introduced, including image retrieval, cross-modal retrieval, recommender systems, and so on.
为了得到高效的哈希编码,对于哈希函数学习,很多方法采用机器学习技术。因此,学习哈希,即为哈希尽可能设计有效的机器学习方法,最近已经成为一个非常热的研究话题,同时在很多大数据领域也有很多应用。这个教程会提供一个学习哈希的系统的介绍,包括动力、模型、学习算法、应用。首先,我们会介绍当我们检索和挖掘大数据时所面临的挑战,这是采用哈希的很好的动力。其次,我们会给出一个关于学习哈希的基础和最近发展的综合性概述,包括无监管哈希、监管哈希、多模态哈希、等。第三,会介绍【量化方法(quantization methods)】,它在很多哈希方法中用来将真实的值转变为二进制码。第四,大量不同的哈希应用也会被介绍,包括图像检索,跨模态检索,推荐系统等等。
References
[1] Peichao Zhang, Wei Zhang, Wu-Jun Li, Minyi Guo. Supervised Hashing with Latent Factor Models. To Appear in Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2014.
[2] Dongqing Zhang, Wu-Jun Li. Large-Scale Supervised Multimodal Hashing with Semantic Correlation Maximization. To Appear in Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence (AAAI), 2014.
[3] Ling Yan, Wu-Jun Li, Gui-Rong Xue, Dingyi Han. Coupled Group Lasso for Web-Scale CTR Prediction in Display Advertising. Proceedings of the 31st International Conference on Machine Learning (ICML), 2014.
[4] Weihao Kong, Wu-Jun Li. Isotropic Hashing. Proceedings of the 26th Annual Conference on Neural Information Processing Systems (NIPS), 2012.
[5] Weihao Kong, Wu-Jun Li, Minyi Guo. Manhattan Hashing for Large-Scale Image Retrieval. Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2012.
[6] Weihao Kong, Wu-Jun Li. Double-Bit Quantization for Hashing. Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence (AAAI), 2012.
Slides & Outline(幻灯片&大纲)
TBD(To Be Determined 待决定; )
Presenter
Wu-Jun Li | Dr. Wu-Jun Li is currently an associate professor of the Department of Computer Science and Technology at Nanjing University, P. R. China. From 2010 to 2013, he was a faculty member of the Department of Computer Science and Engineering at Shanghai Jiao Tong University, P. R. China. He received his PhD degree from the Department of Computer Science and Engineering at Hong Kong University of Science and Technology in 2010. Before that, he received his M.Eng. degree and B.Sc. degree from the Department of Computer Science and Technology, Nanjing University in 2006 and 2003, respectively. His main research interests include machine learning and pattern recognition, especially in statistical relational learning and big data machine learning (big learning). In these areas he has published more than 30 peer-reviewed papers, most in prestigious journals such as TKDE and top conferences such as AAAI, AISTATS, CVPR, ICML, IJCAI, NIPS, SIGIR. He has served as the PC member of ICML‘14, IJCAI‘13/‘11, NIPS‘14, SDM‘14, UAI‘14, etc. 李武军博士目前是中国·南京大学计算机科学与技术系的副教授。从2010 to 2013,他是中国·上海交大的计算机科学与工程系的教员。2010年,他在香港大学计算机科学与工程系荣获博士学位。在这之前,他分别在2006、2003年在南京大学大学计算机科学与技术系获得了硕士工学学位和学士理学学位。他的主要研究兴趣包括机器学习和模式识别,特别是在大数据的统计关系学习和机器学习。在这些领域,他发表了30多篇同行评审论文,大多在例如TKDE等著名的报刊和例如AAAI, AISTATS, CVPR, ICML, IJCAI, NIPS, SIGIR等顶级会议。他曾担任ICML‘14, IJCAI‘13/‘11, NIPS‘14, SDM‘14, UAI‘14的程序委员会成员。 |
学习哈希及哈希在大数据检索和挖掘中的应用