Solr4.8.0源码分析(6)之非排序查询

首页 > 代码库 > Solr4.8.0源码分析(6)之非排序查询

Solr4.8.0源码分析(6)之非排序查询

2024-07-18 20:59:25 225人阅读

Solr4.8.0源码分析(6)之非排序查询

上篇文章简单介绍了Solr的查询流程，本文开始将详细介绍下查询的细节。查询主要分为排序查询和非排序查询，由于两者走的是两个分支，所以本文先介绍下非排序的查询。

查询的流程主要在SolrIndexSearch.getDocListC(QueryResult qr, QueryCommand cmd),顾名思义该函数对queryResultCache进行处理，并根据查询条件选择进入排序查询还是非排序查询。

1   /**
  2    * getDocList version that uses+populates query and filter caches.  3    * In the event of a timeout, the cache is not populated.  4    */  5   private void getDocListC(QueryResult qr, QueryCommand cmd) throws IOException {  6     DocListAndSet out = new DocListAndSet();  7     qr.setDocListAndSet(out);  8     QueryResultKey key=null;  9     int maxDocRequested = cmd.getOffset() + cmd.getLen(); //当有偏移的查询产生，Solr首先会获取cmd.getOffset()+cmd.getLen()个的doc id然后　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　//再根据偏移量获取子集，所以maxDocRequested是实际的查询个数。 10     // check for overflow, and check for # docs in index 11     if (maxDocRequested < 0 || maxDocRequested > maxDoc()) maxDocRequested = maxDoc();// 最多的情况获取所有doc id 12     int supersetMaxDoc= maxDocRequested; 13     DocList superset = null; 14  15     int flags = cmd.getFlags(); 16     Query q = cmd.getQuery(); 17     if (q instanceof ExtendedQuery) { 18       ExtendedQuery eq = (ExtendedQuery)q; 19       if (!eq.getCache()) { 20         flags |= (NO_CHECK_QCACHE | NO_SET_QCACHE | NO_CHECK_FILTERCACHE); 21       } 22     } 23  24  25     // we can try and look up the complete query in the cache. 26     // we can‘t do that if filter!=null though (we don‘t want to 27     // do hashCode() and equals() for a big DocSet).
        // 先从查询结果的缓存区查找是否出现过该条件的查询，若出现过则返回缓存的结果.关于缓存的内容将会独立写一篇文章 28     if (queryResultCache != null && cmd.getFilter()==null 29         && (flags & (NO_CHECK_QCACHE|NO_SET_QCACHE)) != ((NO_CHECK_QCACHE|NO_SET_QCACHE))) 30     { 31         // all of the current flags can be reused during warming, 32         // so set all of them on the cache key. 33         key = new QueryResultKey(q, cmd.getFilterList(), cmd.getSort(), flags); 34         if ((flags & NO_CHECK_QCACHE)==0) { 35           superset = queryResultCache.get(key); 36  37           if (superset != null) { 38             // check that the cache entry has scores recorded if we need them 39             if ((flags & GET_SCORES)==0 || superset.hasScores()) { 40               // NOTE: subset() returns null if the DocList has fewer docs than 41               // requested 42               out.docList = superset.subset(cmd.getOffset(),cmd.getLen()); //如果有缓存，就从中去除一部分子集 43             } 44           } 45           if (out.docList != null) {    46             // found the docList in the cache... now check if we need the docset too. 47             // OPT: possible future optimization - if the doclist contains all the matches, 48             // use it to make the docset instead of rerunning the query.
                //获取缓存中的docSet，并传给result。 49             if (out.docSet==null && ((flags & GET_DOCSET)!=0) ) { 50               if (cmd.getFilterList()==null) { 51                 out.docSet = getDocSet(cmd.getQuery()); 52               } else {   53                 List<Query> newList = new ArrayList<>(cmd.getFilterList().size()+1); 54                 newList.add(cmd.getQuery()); 55                 newList.addAll(cmd.getFilterList()); 56                 out.docSet = getDocSet(newList); 57               } 58             } 59             return; 60           } 61         } 62  63       // If we are going to generate the result, bump up to the 64       // next resultWindowSize for better caching. 65       // 修改supersetMaxDoc为queryResultWindwSize的整数倍 66       if ((flags & NO_SET_QCACHE) == 0) { 67         // handle 0 special case as well as avoid idiv in the common case. 68         if (maxDocRequested < queryResultWindowSize) { 69           supersetMaxDoc=queryResultWindowSize; 70         } else { 71           supersetMaxDoc = ((maxDocRequested -1)/queryResultWindowSize + 1)*queryResultWindowSize; 72           if (supersetMaxDoc < 0) supersetMaxDoc=maxDocRequested; 73         } 74       } else { 75         key = null;  // we won‘t be caching the result 76       } 77     } 78     cmd.setSupersetMaxDoc(supersetMaxDoc); 79  80  81     // OK, so now we need to generate an answer. 82     // One way to do that would be to check if we have an unordered list 83     // of results for the base query.  If so, we can apply the filters and then 84     // sort by the resulting set.  This can only be used if: 85     // - the sort doesn‘t contain score 86     // - we don‘t want score returned. 87  88     // check if we should try and use the filter cache 89     boolean useFilterCache=false; 90     if ((flags & (GET_SCORES|NO_CHECK_FILTERCACHE))==0 && useFilterForSortedQuery && cmd.getSort() != null && filterCache != null) { 91       useFilterCache=true; 92       SortField[] sfields = cmd.getSort().getSort(); 93       for (SortField sf : sfields) { 94         if (sf.getType() == SortField.Type.SCORE) { 95           useFilterCache=false; 96           break; 97         } 98       } 99     }100 101     if (useFilterCache) {102       // now actually use the filter cache.103       // for large filters that match few documents, this may be104       // slower than simply re-executing the query.105       if (out.docSet == null) {106         out.docSet = getDocSet(cmd.getQuery(),cmd.getFilter());107         DocSet bigFilt = getDocSet(cmd.getFilterList());108         if (bigFilt != null) out.docSet = out.docSet.intersection(bigFilt);109       }110       // todo: there could be a sortDocSet that could take a list of111       // the filters instead of anding them first...112       // perhaps there should be a multi-docset-iterator113       sortDocSet(qr, cmd);  //排序查询114     } else {115       // do it the normal way...116       if ((flags & GET_DOCSET)!=0) {117         // this currently conflates returning the docset for the base query vs118         // the base query and all filters.119         DocSet qDocSet = getDocListAndSetNC(qr,cmd);120         // cache the docSet matching the query w/o filtering121         if (qDocSet!=null && filterCache!=null && !qr.isPartialResults()) filterCache.put(cmd.getQuery(),qDocSet);122       } else {123         getDocListNC(qr,cmd); //非排序查询，这也是本文的流程。124       }125       assert null != out.docList : "docList is null";126     }127 128     if (null == cmd.getCursorMark()) {129       // Kludge...130       // we can‘t use DocSlice.subset, even though it should be an identity op131       // because it gets confused by situations where there are lots of matches, but132       // less docs in the slice then were requested, (due to the cursor)133       // so we have to short circuit the call.134       // None of which is really a problem since we can‘t use caching with135       // cursors anyway, but it still looks weird to have to special case this136       // behavior based on this condition - hence the long explanation.137       superset = out.docList; //根据offset和len截取查询结果138       out.docList = superset.subset(cmd.getOffset(),cmd.getLen()); 139     } else {140       // sanity check our cursor assumptions141       assert null == superset : "cursor: superset isn‘t null";142       assert 0 == cmd.getOffset() : "cursor: command offset mismatch";143       assert 0 == out.docList.offset() : "cursor: docList offset mismatch";144       assert cmd.getLen() >= supersetMaxDoc : "cursor: superset len mismatch: " +145         cmd.getLen() + " vs " + supersetMaxDoc;146     }147 148     // lastly, put the superset in the cache if the size is less than or equal149     // to queryResultMaxDocsCached150     if (key != null && superset.size() <= queryResultMaxDocsCached && !qr.isPartialResults()) {151       queryResultCache.put(key, superset);    //如果结果的个数小于或者等于queryResultMaxDocsCached则将本次查询结果放入缓存152     }153   }

进入非排序查询分支getDocListNC(),该函数内部分直接调用Lucene的IndexSearch.Search()

 1       final TopDocsCollector topCollector = buildTopDocsCollector(len, cmd); //新建TopDocsCollector对象，里面会新建(offset + len(查询条          //件的len))的HitQueue，每当获取到一个符合查询条件的doc，就会将该doc id放入HitQueue,并totalhit计数加一，这个totalhit变量也就是查询结果的数量 2       Collector collector = topCollector; 3       if (terminateEarly) { 4         collector = new EarlyTerminatingCollector(collector, cmd.len); 5       } 6       if( timeAllowed > 0 ) { 7         collector = new TimeLimitingCollector(collector, TimeLimitingCollector.getGlobalCounter(), timeAllowed); 
           //TimeLimitingCollector的实现原理很简单，从第一个找到符合查询条件的doc id开始计时，在达到timeAllowed之前，会想查询得到的doc id放入HitQue           //ue,一旦timeAllowed到了，就会立即扔出错误，中断后续的查询。这对于我们优化查询是个重要的提示 8       } 9       if (pf.postFilter != null) {10         pf.postFilter.setLastDelegate(collector);11         collector = pf.postFilter;12       }13       try {
           // 进入Lucene的IndexSearch.Search()14         super.search(query, luceneFilter, collector);15         if(collector instanceof DelegatingCollector) {16           ((DelegatingCollector)collector).finish();17         }18       }19       catch( TimeLimitingCollector.TimeExceededException x ) {20         log.warn( "Query: " + query + "; " + x.getMessage() );21         qr.setPartialResults(true);22       }23 24       totalHits = topCollector.getTotalHits();           //返回totalhit的结果25       TopDocs topDocs = topCollector.topDocs(0, len);    //返回优先级队列hitqueue的doc id26       populateNextCursorMarkFromTopDocs(qr, cmd, topDocs);27 28       maxScore = totalHits>0 ? topDocs.getMaxScore() : 0.0f;29       nDocsReturned = topDocs.scoreDocs.length;30       ids = new int[nDocsReturned];31       scores = (cmd.getFlags()&GET_SCORES)!=0 ? new float[nDocsReturned] : null;32       for (int i=0; i<nDocsReturned; i++) {33         ScoreDoc scoreDoc = topDocs.scoreDocs[i];34         ids[i] = scoreDoc.doc;35         if (scores != null) scores[i] = scoreDoc.score;36       }

TimeLimitingCollector统计查询结果的方法，一旦timeAllowed到了，就会立即扔出错误，中断后续的查询

  /**   * Calls {@link Collector#collect(int)} on the decorated {@link Collector}   * unless the allowed time has passed, in which case it throws an exception.   *    * @throws TimeExceededException   *           if the time allowed has exceeded.   */  @Override  public void collect(final int doc) throws IOException {    final long time = clock.get();    if (timeout < time) {      if (greedy) {        //System.out.println(this+"  greedy: before failing, collecting doc: "+(docBase + doc)+"  "+(time-t0));        collector.collect(doc);      }      //System.out.println(this+"  failing on:  "+(docBase + doc)+"  "+(time-t0));      throw new TimeExceededException( timeout-t0, time-t0, docBase + doc );       }    //System.out.println(this+"  collecting: "+(docBase + doc)+"  "+(time-t0));    collector.collect(doc);  }

接下来开始lucece的查询过程，

1. 首先会为每一个查询条件新建一个Weight的对象，最后将所有Weight对象放入ArrayList<Weight> weights。该过程给出每个查询条件的权重，并用于后续的评分过程。

 1     public BooleanWeight(IndexSearcher searcher, boolean disableCoord) 2       throws IOException { 3       this.similarity = searcher.getSimilarity(); 4       this.disableCoord = disableCoord; 5       weights = new ArrayList<>(clauses.size()); 6       for (int i = 0 ; i < clauses.size(); i++) { 7         BooleanClause c = clauses.get(i); 8         Weight w = c.getQuery().createWeight(searcher); 9         weights.add(w);10         if (!c.isProhibited()) {11           maxCoord++;12         }13       }14     }

2. 遍历所有sgement，一个接一个的查找符合查询条件的doc id。AtomicReaderContext 是包含segment的具体信息，包括doc base，num docs，这些信息室非常有用的，在实现查询优化时候很有帮助。这里需要注意的是这个collector是TopDocsCollector类型的对象，这在上面的代码中已经赋值过了。

 1 /** 2    * Lower-level search API. 3    *  4    * <p> 5    * {@link Collector#collect(int)} is called for every document. <br> 6    *  7    * <p> 8    * NOTE: this method executes the searches on all given leaves exclusively. 9    * To search across all the searchers leaves use {@link #leafContexts}.10    * 11    * @param leaves 12    *          the searchers leaves to execute the searches on13    * @param weight14    *          to match documents15    * @param collector16    *          to receive hits17    * @throws BooleanQuery.TooManyClauses If a query would exceed 18    *         {@link BooleanQuery#getMaxClauseCount()} clauses.19    */20   protected void search(List<AtomicReaderContext> leaves, Weight weight, Collector collector)21       throws IOException {22 23     // TODO: should we make this24     // threaded...?  the Collector could be sync‘d?25     // always use single thread:26     for (AtomicReaderContext ctx : leaves) { // search each subreader27       try {28         collector.setNextReader(ctx);29       } catch (CollectionTerminatedException e) {30         // there is no doc of interest in this reader context31         // continue with the following leaf32         continue;33       }34       BulkScorer scorer = weight.bulkScorer(ctx, !collector.acceptsDocsOutOfOrder(), ctx.reader().getLiveDocs());35       if (scorer != null) {36         try {37           scorer.score(collector);38         } catch (CollectionTerminatedException e) {39           // collection was terminated prematurely40           // continue with the following leaf41         }42       }43     }44   }

3. Weight.bulkScorer对查询条件进行评分，Lucene的多条件查询优化还是写的很不错的。Lucece会根据每个查询条件的词频对查询条件进行排序，词频小的排在前面，词频大的排在后面。这大大优化了多条件的查询。多条件查询的优化会在下文中详细介绍。

4. 最后Lucene会使用scorer.score(collector)这个过程真正的进行查询。看下Weight的两个函数，就能明白Lucene怎么进行查询统计。

 1  @Override 2     public boolean score(Collector collector, int max) throws IOException { 3       // TODO: this may be sort of weird, when we are 4       // embedded in a BooleanScorer, because we are 5       // called for every chunk of 2048 documents.  But, 6       // then, scorer is a FakeScorer in that case, so any 7       // Collector doing something "interesting" in 8       // setScorer will be forced to use BS2 anyways: 9       collector.setScorer(scorer);10       if (max == DocIdSetIterator.NO_MORE_DOCS) {11         scoreAll(collector, scorer);12         return false;13       } else {14         int doc = scorer.docID();15         if (doc < 0) {16           doc = scorer.nextDoc();17         }18         return scoreRange(collector, scorer, doc, max);19       }20     }

Lucece会不停的从segment获取符合查询条件的doc，并放入collector的hitqueue里面。需要注意的是这里的collector是Collector类型，是TopDocsCollector等类的父类，所以scoreAll不仅能实现获取TopDocsCollector的doc is也能获取其他查询方式的doc id。

1     static void scoreAll(Collector collector, Scorer scorer) throws IOException {2       int doc;3       while ((doc = scorer.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {4         collector.collect(doc);5       }6     }

进入collector.collect(doc)查看TopDocsCollector的统计doc id的方式，就跟之前说的一样。

 1     @Override 2     public void collect(int doc) throws IOException { 3       float score = scorer.score(); 4  5       // This collector cannot handle these scores: 6       assert score != Float.NEGATIVE_INFINITY; 7       assert !Float.isNaN(score); 8  9       totalHits++;10       if (score <= pqTop.score) {11         // Since docs are returned in-order (i.e., increasing doc Id), a document12         // with equal score to pqTop.score cannot compete since HitQueue favors13         // documents with lower doc Ids. Therefore reject those docs too.14         return;15       }16       pqTop.doc = doc + docBase;17       pqTop.score = score;18       pqTop = pq.updateTop();19     }

总结：本章详细的介绍了非排序查询的流程，主要涉及了以下几个类QueryComponent,SolrIndexSearch,TimeLimitingCollector,TopDocsCollector,IndexSearch,BulkScore,Weight. 篇幅原因，并没有将如何从segment里面获取doc id以及多条件查询是怎么实现的，这将是下一问多条件查询中详细介绍。

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > Solr4.8.0源码分析(6)之非排序查询

Solr4.8.0源码分析(6)之非排序查询

Solr4.8.0源码分析(6)之非排序查询

看完仍有疑问？有类似问题直接问程序猿