solr enterprise search application server

catalog

1. Introduction to Solr

In fact, we use almost all of them, such as Taobao, Jingdong, Baidu and other large websites you use

Take a look at Baidu's
Take a look at Jingdong's
Many people actually have a question: Why are their searches so fast? What technology is it implemented through?

In common projects, our search function is very common, that is, fuzzy query of database table data through a certain field. If there are multiple fields, we have to choose the whole type. Does it feel tedious. In particular, the trading volume and product volume of e-commerce platform is huge, once the search will increase the pressure of search, what if there is a certain amount of concurrency? As you can imagine, it's lucky that the website doesn't collapse

This is the reason why search needs to be transferred to external search servers. What is solr?

Solr is a high performance, Lucene based full-text search server. At the same time, we extend it, provide more abundant query language than Lucene, realize configurable and extensible, optimize query performance, and provide a perfect function management interface, which is a very excellent full-text search engine.
lucene
Lucene is a subproject of apache jakarta project and an open source full-text search engine development kit, but it is not a complete full-text search engine, but a full-text search engine architecture, providing a complete query engine and index engine, part of the text analysis engine. The purpose of Lucene is to provide a simple and easy-to-use toolkit for software developers to facilitate the realization of full-text retrieval function in the target system, or to build a complete full-text retrieval engine based on this.

2. Inverted index

Introduction:
Inverted index is based on the word or word as the key word. The record table entry corresponding to the key word in the table records all documents with the word or word. A table entry is a word table segment, which records the ID of the document and the position of the character in the document.

for instance:
Generally, we find the document first, and then find the contained words in the document;

Inverted index is the process of using words to find out the documents it appears

Practical examples

Document number Document content
1 Full text search engine Toolkit
2 The architecture of full-text search engine
3 Query engine and index engine

Participle result

Document number Segmentation result set
1 {full text, search, engine, tool, package}
2 {full text, search, engine, of, architecture}
3 {query, engine, and, index, engine}

Inverted index

number word Document number list
1 full text 1,2
2 retrieval 1,2
3 engine 1,2,3
4 tool 1
5 package 1
6 framework 2
7 query 3
8 Indexes 3

explain:
The number of documents corresponding to each word or word is dynamic, so the establishment and maintenance of the inverted table are more complex. However, when querying, the efficiency is higher than that of the forward table because all documents corresponding to the query keywords can be obtained at one time. In full-text retrieval, the fast response of retrieval is the most critical performance, and index building is carried out in the background, although the efficiency is relatively low, but it will not affect the efficiency of the whole search engine.

3. Introduction to Lucene API

Create a maven project:

pom.xml Dependency introduction:

    <dependencies>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-core</artifactId>
            <version>8.0.0</version>
        </dependency>

        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.12</version>
        </dependency>

        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analyzers-smartcn</artifactId>
            <version>8.0.0</version>
        </dependency>
    </dependencies>

Create a test class and add the following code:

import org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer;
import org.apache.lucene.document.*;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.FSDirectory;
import org.junit.Test;

import java.io.File;

public class Test1 {
    String[] a = {
            "3, Huawei - Huawei computer, Hot money",
            "4, Huawei Mobile, flagship",
            "5, association - Thinkpad, Business book",
            "6, Lenovo Mobile, Self portrait artifact"
    };

    @Test
    public void test1() throws Exception {
        //Path to store index file
        File path = new File("d:/abc/");
        FSDirectory d = FSDirectory.open(path.toPath());
        //Chinese word segmentation provided by lucene
        SmartChineseAnalyzer analyzer = new SmartChineseAnalyzer();
        //Specifying word breakers by configuring objects
        IndexWriterConfig cfg = new IndexWriterConfig(analyzer);
        //Index output tool
        IndexWriter writer = new IndexWriter(d, cfg);

        for (int i = 0; i < a.length; i++) {
            String[] strs = a[i].split(",");

            //Create a document that contains the fields to index
            Document doc = new Document();
            doc.add(new LongPoint("id", Long.parseLong(strs[0])));
            doc.add(new StoredField("id", Long.parseLong(strs[0])));
            doc.add(new TextField("title", strs[1], Field.Store.YES));
            doc.add(new TextField("sellPoint", strs[2], Field.Store.YES));

            //Write document to disk index file
            writer.addDocument(doc);
        }
        writer.close();
    }
}

Start program create index file

View index
luke is an index viewing tool. Download: https://github.com/DmitryKey/luke/releases

Note that the version of luke should be the same as that of lucene you use
Or download the source code and pack it with maven command.
Source address: https://github.com/lmy1965673628/luke.git

Find the jar package, run it, and open the storage location of the index


Specify the word breaker and test the word breaker
Query test
Enter the name to query. You can see that all the qualified contents will be displayed below
Query of id
The id should be changed to long type because it is stored in the program
Let's change the condition to id=5 and see the result.
Query from index

Add test2() test method to test class

    @Test
    public void test2() throws Exception {
        //Storage directory of index data
        File path = new File("d:/abc");
        FSDirectory d = FSDirectory.open(path.toPath());
        //Create a search tool object
        DirectoryReader reader = DirectoryReader.open(d);
        IndexSearcher searcher = new IndexSearcher(reader);

        //Keyword searcher, we search "title: Huawei"
        TermQuery q = new TermQuery(new Term("title", "Huawei"));
        //Execute the query and return the first 20 pieces of data
        TopDocs docs = searcher.search(q, 20);

        //Traverse the queried result document and display
        for (ScoreDoc scoreDoc : docs.scoreDocs) {
            Document doc = searcher.doc(scoreDoc.doc);
            System.out.println(doc.get("id"));
            System.out.println(doc.get("title"));
            System.out.println(doc.get("sellPoint"));
            System.out.println("--------------");
        }
    }

Run to see the results: compare the results of the luke tool query

4. solr installation

Download address: http://archive.apache.org/dist/lucene/solr/8.0.0/
I put it on Linux, so download the Linux version

Transfer the file to / home directory
Unzip solr

cd /home
# Upload solr-8.0.0.tgz to / usr/local directory
# And decompress
tar -xzf solr-8.0.0.tgz

Start solr

cd /home/solr-8.0.0
# It is not recommended to use the administrator to start solr with - force
bin/solr start -force
#If firewall is on
# Open 8983 port
firewall-cmd --zone=public --add-port=8983/tcp --permanent
firewall-cmd --reload

Browser access to solr console

http: / / server ip:8983

Create core
Take a look at my database

PD in database_ The item data in the item table, the index data is saved in solr, a kind of data, and a core is created in solr to save the index data

To create a core named pd, first prepare the following directory structure:

# solr directory / server/solr/
#                    pd/
#                     conf/
#                     data/


cd /usr/local/solr-8.0.0

mkdir server/solr/pd
mkdir server/solr/pd/conf
mkdir server/solr/pd/data

The conf directory is the configuration directory of the core, which stores a set of configuration files. Based on the default configuration, we will modify it step by step
Copy default configuration

cd /usr/local/solr-8.0.0

cp -r server/solr/configsets/_default/conf server/solr/pd

Create a core named pd
Chinese word segmentation test
Fill in the following text and observe the segmentation results:

Solr is a high-performance, Java 5 based full-text search server. At the same time, we extend it to provide a richer query language than Lucene, realize configurable, extensible and optimize query performance, and provide a perfect function management interface, which is a very excellent full-text search engine.

Chinese word segmentation tool - IK analyzer

https://github.com/magese/ik-analyzer-solr

Download the source code. You need to pack it yourself and transfer it to solr directory / server / solr webapp / webapp / WEB-INF / Lib

Remember, you need to have a good version before packaging

Download the required files. I've sorted them out
Link: https://pan.baidu.com/s/1xSfUi9C5LpN98aUQL8eI6A
Extraction code: d7k4

Copy all jar packages in the file to / server / Solr webapp / webapp / WEB-INF / Lib

Copy other files to solr directory / server / solr webapp / webapp / WEB-INF / classes
If there is no classes folder, create one

Configure managed schema
Modify solr directory / server / solr / Pd / conf / managed schema, and add IK analyzer word breaker at the end of the file

<!-- ik Tokenizer  -->
<fieldType name="text_ik" class="solr.TextField">
  <analyzer type="index">
      <tokenizer class="org.wltea.analyzer.lucene.IKTokenizerFactory" useSmart="false" conf="ik.conf"/>
      <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
      <tokenizer class="org.wltea.analyzer.lucene.IKTokenizerFactory" useSmart="true" conf="ik.conf"/>
      <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

Restart solr service

cd /usr/local/solr-8.0.0

bin/solr restart -force

Using IK analyzer to test Chinese word segmentation
Fill in the following text and choose to use text_ik participator, observe the segmentation results:

Solr is a high-performance, Java 5 based full-text search server. At the same time, we extend it to provide a richer query language than Lucene, realize configurable, extensible and optimize query performance, and provide a perfect function management interface, which is a very excellent full-text search engine.

Set stop word

Upload stop word configuration file to solr directory / server / solr webapp / webapp / WEB-INF / classes

stopword.dic
stopwords.txt

I have already sorted out documents, which can be expanded as needed
Restart the service and observe that the stop word is ignored in the segmentation result

bin/solr restart -force

Some time ago, some of the company's information columns encountered politically sensitive words, and then quickly gathered product, development, data and other centers to find relevant information. How is the general operation? Go to the database and quickly write sql statements to filter and find them one by one. Then delete

But Baidu can do instant shielding. It is by virtue of the advantages of search system that we can have such speed, and we can see that the former way has no advantages in any way
Once the government publishes some sensitive words, it can quickly block and locate them. Greatly improved the speed

5. Connect mysql

Grant root cross network access
Note: the root user of remote login is set here, and the password of root user of local login is unchanged

grant all on *.* to 'root'@'%' identified by 'root';

But my mysql version is more than 8. There is an error with this command, and another one is used:

grant all on *.* to 'root'@'%';

Refresh authorization (effective now)

flush privileges;


You can use the database I provide: GitHub address
Import to local, we use pd_item table for testing

Randomly modify 30% of the products and remove them from the shelves for later query and test

UPDATE pd_item SET STATUS=0 WHERE RAND()<0.3

Import product data from mysql

Set field

  • title text_ik
  • sellPoint text_ik
  • price plong
  • barcode string
  • image string
  • cid plong
  • status pint
  • created pdate
  • updated pdate

Copy Field copy field

When querying, you need to query by fields, such as Title: computer. You can merge the values of multiple fields into one field for query. The default query field is_ text_ , copy title and sellPoint to_ text_ field


Add jar file
The jar files of Data Import Handler are stored in solr directory / dist directory,
Copy solr-dataimporthandler-7.5.0.jar, solr-dataimporthandler-extras-7.5.0.jar to /home/solr-8.0.0/server/solr-webapp/webapp/WEB-INF/lib

One of the steps we did when we configured

We need to enter / home/solr-8.0.0/server/solr/pd/conf to modify the configuration


The first step, at solrconfig.xml Configuration added at the end of Chinese:

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
    <lst name="defaults">
        <str name="config">data-config.xml</str>
    </lst>
</requestHandler>

Step 2: create data-config.xml File is used to configure database connection: change to the relevant information of your own database according to the format
You need to consider the mysql version,

  • My sql version is more than 8.0, so the driver should use com.mysql.cj.jdbc.Driver
  • Note also in the URL: serverTimezone: time zone allowPublicKeyRetrieval: public key verification these two must be written in this way and must not be less, or the connection will report an error
  • Change & in url to&
<dataConfig>
    <!-- database information  -->
    <dataSource type="JdbcDataSource" 
        driver="com.mysql.cj.jdbc.Driver" 
        url="jdbc:mysql://172.16.2.134:3306/pd_store?userSSL=true&amp;useUnicode=true&amp;characterEncoding=UTF8&amp;useSSL=false&amp;serverTimezone=GMT%2B8&amp;allowPublicKeyRetrieval=true" 
        user="root" password="root"/>
    <document>
        <!-- document entity -->
        <entity name="item" query="SELECT * FROM pd_item">
            <!-- Database field mapping solr field -->
            <field column="title" name="title"/>
            <field column="sell_point" name="sellPoint"/>
            <field column="price" name="price"/>
            <field column="barcode" name="barcode"/>
            <field column="image" name="image"/>
            <field column="cid" name="cid"/>
            <field column="status" name="status"/>
            <field column="created" name="created"/>
            <field column="updated" name="updated"/>
        </entity>
    </document>
</dataConfig>
~             

Restart solr

cd /usr/local/solr-8.0.0
bin/solr restart -force

Refresh, click data import, and you can see the relevant information has appeared on the right side

Then you can import the data. Finally, you can choose to refresh the information automatically, so you can see the data import situation

If the configuration is OK, you can see that the information on the right will change, including the data transmission rate and the total amount of imported data. If not, please confirm the configuration information again

Query test

Find computer in title
You can see that there are 10 pieces of data. You may wonder if this is right or not
Let's test it with navicat: you can see that there are indeed 10

The bold style uses double quotation marks to find the complete word "notebook". When you carefully check all the data, you can find that no matter the title or sellPoint field has a notebook, you will find
Search + Lenovo + computer. This query contains all the information of Lenovo and computer keywords at the same time

Search + Lenovo - computer. This query is all information that contains Lenovo but does not contain computer keywords

Statistics of cid is used to count the information of cid field, that is, count the number of each value of cid
Change parameter to *:*


At the end, you can see the information in the format: "cid": [quantity]

You can also add conditions, such as the statistical cid value is greater than 50, and add parameters: facet.mincount=50

Looking at the results, we can see that all cid values are greater than 50

Price range
Fill in the Raw Query Parameters input box with the following:

facet.range=price&facet.range.start=0&facet.range.end=10000&facet.range.gap=2000

It means that the price range is 0-10000, and 2000 is increasing. The total number of each price is calculated

Take a look at the results

Multi field statistics
Fill in the * * Raw Query Parameters * * input box with the following:

facet.pivot=cid,status


Take a look at the results: you can see that the information is very detailed

6. Practice

Take a look at my project: GitHub address

Run, view home page
This is almost the same as the general e-mall page, and it also has search function, so we use solr to realize it
Commodity retrieval call analysis


pom.xml Add solr and lombok dependencies

<dependency>
	<groupId>org.springframework.boot</groupId>
	<artifactId>spring-boot-starter-data-solr</artifactId>
</dependency>

<dependency>
	<groupId>org.projectlombok</groupId>
	<artifactId>lombok</artifactId>
</dependency>

application.yml Add solr connection information

spring:
  data:
    solr:   #Pay attention to ip address modification
      host: http://192.168.64.170:8983/solr/pd

Item entity class

@Data
public class Item implements Serializable {
		private static final long serialVersionUID = 1L;
		
		@Field("id")
		private String id;
		@Field("title")
		private String title;
		@Field("sellPoint")
		private String sellPoint;
		@Field("price")
		private Long price;
		@Field("image")
		private String image;

}

SearchService business interface

public interface SearchService {
	List<Item> findItemByKey(String key) throws Exception;
}

SearchServiceImpl business implementation class

@Service
public class SearchServiceImpl implements SearchService {
	
	/*
	 * SolrClient Instances are created in the SolrAutoConfiguration class
	 * 
	 * SolrAutoConfiguration Added @ Configuration annotation,
	 * Is the spring boot auto configuration class, in which the solrClient() method creates a SolrClient instance
	 */
	@Autowired
	private SolrClient solrClient;

	@Override
	public List<Item> findItemByKey(String key) throws Exception {
		//Key words to encapsulate query
		//It can also encapsulate other query parameters, such as specified fields, facet settings, etc
		SolrQuery query = new SolrQuery(key);
		//How many pieces of data before query
		query.setStart(0);
		query.setRows(20);
		
		//Execute the query and get the query result
		QueryResponse qr = solrClient.query(query);
		//Turn query results into a group of commodity instances
		List<Item> beans = qr.getBeans(Item.class);
		return beans;
	}
}

SearchController controller

@Controller
public class SearchController {
	@Autowired
	private SearchService searchService;
	
	@GetMapping("/search/toSearch.html")
	public String search(String key, Model model) throws Exception {
		List<Item> itemList = searchService.findItemByKey(key);
		model.addAttribute("list", itemList);
		return "/search.jsp";
	}
}


What if you want to achieve the same effect as the console?
For example, I want to implement the parameter title: computer effect. We tested a total of 10 data above
How should the code be written?

	@Override
	public List<Item> findItemByKey(String key) throws Exception {
		//Key words to encapsulate query
		//It can also encapsulate other query parameters, such as specified fields, facet settings, etc
//		SolrQuery query = new SolrQuery(key);
		SolrQuery query = new SolrQuery();
		//How many pieces of data before query
		query.setStart(0);
		query.setRows(20);
		query.setQuery("title:"+key);
		//Execute the query and get the query result
		QueryResponse qr = solrClient.query(query);
		//Turn query results into a group of commodity instances
		List<Item> beans = qr.getBeans(Item.class);
		return beans;
	}
}

Take a look at the results: a total of 10 data.
It can be seen that in order to achieve a variety of search results, only different conditions need to be spliced, so we can write a special interface according to the requirements in advance, so as to meet the actual needs.

Tags: solr Apache MySQL Database

Posted on Fri, 05 Jun 2020 03:38:21 -0400 by BillyMako