Lucene's actual combat catalogue

catalogue

1 partial Lucene core

Chapter 1 Meet Lucene 3

1. 1 Response to information explosion 4

1.2 What is Lucene 5?

What can Lucene do 6

1.2.2 History of Lucene7

1.3 Lucene and search component 9

1.3. 1 index component 10

1.3.2 search component 13

1.3.3 Other modules of the search program 16

1.3.4 integration point between Lucene and application 18.

1.4 Lucene actual combat: program example 18

1.4. 1 Create index 19

1.4.2 search index 22

1.5 Understanding the core classes of the indexing process 25

1.5. 1 index writer 25

1.5.2 Directory 25

1.5.3 analyzer 26

1.5.4 file 26

1.5.5 field 27

1.6 Understanding the Core Classes of Search Process 27

1.6. 1 index searcher 27

1.6.2 Term 28

1.6.3 Query 28

1.6.4 TermQuery 28

1.6.5 TopDocs 29

1.7 Summary 29

Chapter II Building Index 30

2.1How does Lucene model the search content 3 1

2. 1. 1 Documents and Domains 3 1

2. 1.2 flexible architecture 32

2. 1.3 denormalization) 32

2.2 Understanding the indexing process 33

2. 2. 1 Extract Text and Create Document 33

File analysis 34

2.2.3 Adding Documents to the Index 34

2.3 Basic Index Operation 35

Add file 35 to the index

2.3.2 Delete documents in the index 38

2.3.3 Updating documents in the index 39

2.4 domain option 4 1

2.4. 1 domain index option 4 1

2.4.2 Domain storage options 42

2.4.3 Item Vector Options for Field 42

2.4.4 Reader, token stream and byte[] field value 42.

2.4.5 Domain Option Combination 43

2.4.6 Domain sorting options 44

2.4.7 Multi-range 44

2.5 Weighting documents and fields 45

2.5. 1 file weighting operation 45

2.5.2 Domain weighting operation 46

2.5.3 Weighting criteria) 47

2.6 index number, date and time 48

1 index number 48

Index date and time 49

2.7 field truncation) 50

2.8 near real-time search 5 1

2.9 optimization index 5 1

2. 10 Other categories 52

2. 1 1 concurrency, thread safety and locking mechanism 55

2. 1 1. 1 thread safety and multi-virtual machine safety 55

2. 1 1.2 accessing the index through the remote file system 56

2. 1 1.3 index locking mechanism 57

2. 12 debugging index 59

2. 13 advanced index concept 60

Use IndexReader to delete document 6 1.

2. 13.2 Recover the used disk space of deleted documents 62

2. 13.3 buffering and refreshing 62

2. 13.4 index submission 63

2. 13.5 ACID trading and index continuity 65

2. 13.6 merge segment 66

2. 14 summary 68

Chapter 3 Adding Search Function to the Application 70

3. 1 Implement simple search function 7 1

3. 1. 1 Search for specific items 72

3. 1.2 Analyze the query expression entered by the user: QueryParser 73.

3.2 Using Index Searcher Category 76

3.2. 1 Create Index Searcher Category 76

3.2.2 Implementing the search function 78

3.2.3 using TopDocs class 78

3.2.4 Pagination of search results

3.2.5 Near Real-time Search 79

3.3 Understand the scoring mechanism of Lucene 8 1

3.3. How does1Lucene score 8 1?

3.3.2 Use explain () to understand the search result score 83

3.4 Lucene's diversified queries 84

3.4. 1 search by item: $ TermQuery Class 85

3.4.2 Search within the specified project scope: TermRangeQuery Class 86.

3.4.3 Search within the specified number range: NumericRangeQuery class 87.

3.4.4 Search by String: Prefix Query Class 88

3.4.5 Combined Query: BooleanQuery Class 88

3.4.6 Through phrase search: phrase query class 9 1

Wildcard query category 93

3.4.8 Searching for Similar Items: Fuzzy Query Category 94

3.4.9 Match all documents: MatchAllDocsQuery class 95

3.5 parsing query expression: QueryParser 96

3.5. 1 Query.toString method 97

Term query 97

3.5.3 Material range query 98

Number range search and date range search 99

3.5.5 Prefix Query and Wildcard Query 99

Boolean operator 100

3.5.7 phrase query 100

3.5.8 Fuzzy query 10 1

MatchAllDocsQuery 102

3.5. 10 group query 102

3.5. 1 1 domain selection 103

3.5. 12 Set the weight for the subquery 103.

3.5. 13 do I have to use querypass103?

3.6 Summary 104

Chapter 4 Analysis Process of Lucene 105

4. 1 using analyzer 106

4. 1. 107 process analysis

4. 1.2 QueryParser analysis 109

4. 1.3 Analytic vs Analysis: When is the analyzer no longer applicable to 109?

4.2 parsing analyzer 1 10

4.2. 1 Composition of lexical units11/

4.2.2 Revealing lexical unit flow 1 12

4.2.3 Observation analyzer 1 15

4.2.4 Lexical unit filtering: the importance of filtering order 1 19

4.3 Use the built-in analyzer 12 1

4.3. 1 Stop analyzer 122

Standard analyzer 122

4.3.3 Which core analyzer 123 should be used?

4.4 Near-sound Word Query 123

4.5 Synonyms, aliases and other words with the same meaning 126

4.5. 1 Create Synonym Analyzer 127

4.5.2 Display the position of the lexical unit 13 1.

4.6 stemming analysis 132

4.6. 1 StopFilter, reserved space 133.

4.6.2 Combine the stemming operation with the stop words removal operation 134.

4.7 domain analysis 134

4.7. 1 multiple interval analysis 135

4.7.2 Analysis of Specific Fields 135

4.7.3 Search for unresolved domain 136.

4.8 language analysis 139

4.8. 1 Unicode and character encoding 139

4.8.2 Non-English Language Analysis 140

4.8.3 Character normalization processing 140

4.8.4 Asian Language Analysis 14 1

4.8.5 Other Issues on Non-English Language Analysis 143

4.9 Nutch analysis 144

4. 10 summary 146

Chapter 5 Advanced Search Technology 147

5. 1 Lucene domain cache 148

5. 1. 1 Load domain values for all documents 149.

Reader 149 corresponding to paragraph 5. 1.2.

5.2 Sort the search results 150

5.2. 1 Sort by domain value 150

5.2.2 Sort by relevance 153

5.2.3 Sort by index order 154

5.2.4 Sort by domain 154

5.2.5 Reverse Sorting 155

5.2.6 Sort by multiple domains 156

5.2.7 Sort field selection type 157.

5.2.8 Sort 157 with non-default locale.

5.3 using multiprasequery 158

5.4 Multi-domain One-time Query 160

5.5 Span Query 162

5.5. 1 Span query building module: SpanTermQuery 165

5.5.2 Find the span at the beginning of the domain 166.

5.5.3 Adjacent Span 167

5.5.4 Exclude the overlapping span of 169 from the matching results.

SpanOrQuery class 170

SpanQuery class and QueryParser class 17 1

5.6 search filtering 172

5.6. 1 terminal ranging filter 173

Digital range filter 174

5 . 6 . 3 FieldCacheRangeFilter 174

5.6.4 Specific project filtering 174

5.6.5 Using QueryWrapperFilter class 175

Use SpanQueryFilter class 175.

Safety filter 176

5.6.8 Use BooleanQuery class 177 for filtering.

Pre-filter 178

5.6. 10 Cache filtering results 178

5.6. 1 1 Encapsulate the filter as a query 179.

5.6. 12 filter 179

5.6. 13 non-Lucene built-in filter 180

5.7 Use function query to realize custom scoring 180.

5.7. 1 function query of related classes 180

5.7.2 Use the function query 182 to weight the recently modified document.

5.8 Search multiple indexes 184

5.8. 1 Use multiple search classes 184.

5.8.2 Multi-thread search by using Parallel Multisearch Archer186.

5.9 Use the term vector 186.

5.9. 1 Search for similar books 187

5.9.2 What kind does it belong to? 190.

5.9.3 Terminal 193 level

5. 10 Load the domain 194 with the FieldSelector.

5. 1 1 Stop slow search 195

5. 12 summary 196

Chapter VI Extended Search 198

6. 1 Use custom sorting method 199.

6. 1. 1 document index based on geographical location sorting method 199

6. Implementation of1.2 self-defined geographical location sorting mode 200

6. 1.3 accessing the value in the custom sort 203

6.2 Develop a custom collector 204

Collector Basic Category 205

6.2.2 Custom Collector: BookLinkCollector 206

AllDocCollector class 207

6.3 extending QueryParser class 208

6.3. 1 Customize QueryParser Behavior 208

6.3.2 Disable fuzzy queries and wildcard queries 209

6.3.3 Handle range query of numerical domain 2 10.

6.3.4 The processing date range is 2 1 1.

6.3.5 Query order phrase 2 13

6.4 Custom Filter 2 15

6.4. 1 Implementing custom filters 2 15

6.4.2 Use the custom filter 2 16 in the search process.

6.4.3 Another option: FilterQuery class 2 17.

6.5 payload) 2 18)

6.5. 1 Payload generated during analysis 2 19

6.5.2 Payload used during search 220

6.5.3 Payload and Span Inquiry 223

6.5.4 Recover payload position through $223

6.6 Summary 223

Part 2 Lucene application

Chapter 7 Using Tika to Extract Text 227

7. What is1Tika228?

7.2 logic design of tika and API 230

7.3 install Tika 23 1

7.4 Tika's built-in text extraction tool 232

7.5 Programming for Text Extraction 234

7.5. 1 index Lucene document 234

Class 237 Tika tool

7.5.3 Select custom analyzer 238

7.6 limitations of tika 238

7.7 index custom XML file 239

7.7. 1 Parsed with SAX 239

7.7.2 Using Apache Commons Digester for Parsing and Indexing 242

7.8 Other options 244

7.9 Summary 245

Chapter VIII Basic Extension of Lucene 246

8. 1 Luke: Lucene's index toolbox 247

8. 1. 1 overview tab: global view of index 248

8. 1.2 Browse documents 249

8. 1.3 search with QueryParser 25 1

8. 1.4 Files and Plugins Tab 252

8.2 Analyzer, Lexical Unit and Lexical Unit Filter 253

Snowball analyzer 255

Ngram filter 256

The tile filter 258

8.2.4 Acquisition of donated analyzers 258

8.3 Highlight query item 259

8.3. 1 Highlight module 259

8.3.2 Independent Highlight Example 262

8.3.3 Using CSS to Highlight 263

8.3.4 Highlight search results 264

8.4 FastVector highlighter 266 class

8.5 spell check 269

8.5. 1 Generate prompt list 269

8.5.2 Choose the best tip 27 1

8.5.3 Display search results 272 to the user

8.5.4 Some Considerations about Strengthening Spelling Check273

8.6 Eye-catching query extension function 274

8.6. 1 More information

8.6.2 Fuzzy Similar Query 275

8.6.3 Enhanced Query 275

The term filter 276

8.6.5 Repeat filter 276

RegexQuery 276

8.7 Building software donation module (contrib module) 277

8.7. 1 source code acquisition method 277

8.7.2 Ant plug-in 277 of contrib directory

8.8 Summary 278

Chapter 9 Lucene Advanced Extension 279

9. 1 chain filter 280

9.2 Using Berkeley DB to Store Indexes 282

9.3 WordNet synonyms 284

Create Synonym Index 285

9.3.2 Linking WordNet synonyms to Analyzer 287

9.4 Fast Index Based on Memory 289

9.5 XML QueryParser: Search Interface Beyond "One Box" 289

9.5. 1 Use XmlQueryParser 29 1.

9.5.2 Extended XML Query Syntax 295

9.6 Peripheral Query Language 296

9.7 space Lucene 298

9.7. 1 index spatial data 299

9.7.2 Search Spatial Data 302

9.7.3 Performance Characteristics of Space Lucene 304

9.8 Remote Multi-index Search 306

9.9 flexible QueryParser 309

9. 10 Other contents 3 12

9. 1 1 Summary 3 13

Chapter 10 other programming languages use Lucene 3 14.

Introduction to 10. 1 Transplant 3 15

10. 1. 1 transplant option 3 16

10. 1.2 Select the appropriate migration version 3 17.

10.2 chloroneoprene (C++) 3 17

Purpose of transplantation 3 18

10.2.2 API and index compatibility 3 19

10.2.3 Supported platforms

10.2.4 present situation and future prospect 32 1

10.3 Lucene-Net(C#, etc. NET programming language) 32 1

10.3. 1 API compatibility 323

10.3.2 index compatibility 324

10.4 KinoSearch and Lucy(Perl) 324

1 Keno Search 325

Lucy 327

10.4.3 other Perl options 327

10.5 ferret (ruby) 328

10.6 PHP 329

10.6. 1 Zend framework329

10.6.2 PHP bridge 330

10.7 PyLucene(Python) 330

10.7. 1 API compatibility 332

10.7.2 Other Python options 332

10.8 Solr (including multiple programming languages) 332

10.9 Summary 334

Chapter 1 1 Lucene management and performance tuning 335

1 1. 1 performance tuning 336

11.1.1simple performance adjustment step 337

1 1. 1.2 test method 338

1 1. 1.3 index search delay adjustment 339

1 1. 1.4 index operation throughput adjustment 340

1 1. 1.5 search delay and search throughput adjustment 344

1 1.2 multithreading and parallel processing 346

1 1.2. 1 Use multithreading for index operation 347.

1 1.2.2 Use multithreaded search operation 35 1

1 1.3 Resource consumption management 354

1 1.3. 1 disk space management 354

1 1.3.2 file descriptor management 357

1 1.3.3 Memory management 36 1

1 1.4 hot backup index 364

1 1.4. 1 Create an index backup 365

1 1.4.2 Restore index 366

1 1.5 Common errors 367

1 1.5. 1 index damage 367

1 1.5.2 repair index 369

1 1.6 Summary 369

The third part is case analysis.

Chapter 12 Case Analysis 1: Kruger 373

Introduction to Kruger 374

12.2 application architecture 375

12.3 search performance 376

12.4 source code analysis 377

12.5 substring search 378

12.6 query VS search 38 1

12.7 improvement space 382

12.7. 1 field cache memory usage 382

12.7.2 comprehensive index 382

12.8 Summary 383

Chapter 13 Case Analysis 2: Witch 384

13. 1 Brief Introduction of Alarm 385

Siren advantage 385

13.2. 1 Search all domains 387

An efficient dictionary 388

13.2.3 Variable field 388

13.2.4 Efficient Processing of Multi-range 388

13.3 index entity with alarm 388

13.3. 1 data model 389

13.3.2 implementation problems 389

13.3.3 index summary 390

13.3.4 data preparation before indexing 390

13.4 use alarm 392 to search for entities.

13. 4. 1 search content 392

13.4.2 Limit the search scope by cell 393.

13.4.3 Combines cells into tuples 393.

13.4.4 Entity Description Query 394

13.5 integrate the alarm 394 in Solr.

13.6 benchmark 395

13.7 summary 397

Chapter 14 Case Analysis 3: LinkedIn 398

14. 1 group search with Bobo browsing 398

Design 14. 1. 1 Bobo Browsing 400

14. 1.2 deep grouping search 403

14.2 real-time search using Zoie 405

14.2. 1 Zoe Architecture 406

Real time and near real time 409

14.2.3 file and index request411

14.2.4 custom index reader 4 1 1

14.2.5 Compare the near real-time search of Lucene 4 12.

14.2.6 Distributed Search 4 13

14.3 Summary 4 15

Installation of Appendix A Lucene 4 16

A. 1 binary installation 4 16

A.2 Run the command line demo program 4 17

A.3 Run the Web application demo program 4 18

A.4 compiling source code 4 19

A.5 wrong arrangement

Appendix B Lucene index format 42 1

B. 1 logical index view 42 1

B.2 About index structure 422

B.2. 1 Understanding the multi-file index structure 422

B.2.2 Understanding the Compound Exponential Structure 425

B.2.3 transformation index structure 426

B.3 inverse index 427

B.4 summary 430

Appendix C Lucene/Contib Benchmark 43 1

C. 1 Run the test script 432

C.2 components of the test script 435

C.2. 1 content source and document generator 438

C.2.2 query generator 439

C.3 control structure 439

C.4 built-in task 44 1

C.4. 1 Creating and using line files 445

C.4.2 built-in report task 446

C.5 evaluating search quality 446

C.6 error handling 449

C.7 summary 449

Appendix d resources 450

D. 1 Lucene knowledge base 450

D.2 internationalization

D.3 language detection 45 1

D.4 Term Vector 45 1

D.5 Lucene transplant version 45 1

D.6 case studies 452

D.7 others

D.8 information retrieval software 452

D.9 works by Doug Cardin 453

D.9. 1 parliamentary documentation 453

D.9.2 U.S. Patent 454