Powerful vector database: Milvus


In the recommendation system, the nearest neighbor retrieval of vectors is a very key step, especially in the recall process. Commonly used ones such as Annoy and faiss can meet most needs. Today we will introduce another one: Milvus

Milvus

Unlike vector retrieval tools such as Annoy and faiss, Milvus is an open source vector database that enables AI applications and vector similarity search.

Terms involved

  1. File: similar to table fields, it can be structured data or, of course, vectors;
  2. Entity: a group of fileds, a piece of data similar to a table;
  3. Collection: a group of entities, similar to a table;

Highlights

  1. Milvus is not only a vector retrieval tool, but a vector database, which can isolate and store vectors of different services separately;
  2. Provide visual management tools;
  3. Support vector hybrid retrieval with filtering conditions.

Foreword tips

This article introduces the latest official version 2.0.0rc4: https://milvus.io/cn/docs/v2.0.0/home , because a lot of powerful functions have been added, so I tried it. However, 2.x is still in iteration and is not a stable version. After the experiment, it is also found that there are some problems, such as sometimes unable to query, but these problems do not exist in 1.x.

Therefore, you'd better use the latest stable Version 1.1.1, but it lacks some functions. https://milvus.io/cn/docs/v1.1.1/home

The official website shows that the stable version of 2. X will be launched soon, and then the 2. X version will be updated and put into production.

Milvus 2.0Milvus 1.x
frameworkCloud primordialShared storage
Scalability500 + nodes1 - 32 read nodes, 1 Write node
persistenceObject storage (OSS), distributed file system (DFS)Local disk, network file system (NFS)
usability99.9%99%
Data consistencyMultiple consistency strongbound stalenesssessionconsistent prefixFinal agreement
Data type supportVector data scalar data string and text (under development)Vector data
basic operation Insert data, delete data (under development), data query similar nearest neighbor (ANN), search radius based nearest neighbor algorithm (RNN) (under development)Insert data delete data similar nearest neighbor (ANN) search
Advanced featuresScalar field filtering Time Travel multi cloud / regional deployment data management toolMishardsMilvus DM data migration tool
Index typeFaiss, Annoy, Hnswlib, RNSG, scan (under development), on disk index (under development)Faiss,Annoy,Hnswlib,RNSG
SDKPython go (under development) Java (under development) RESTful (under development) C + + (under development)Python,Java,Go,RESTful,C++
current statePreview version. The stable version is expected to be released in August 2021.Long term support (LTS) version

Docker installation

Before installing milvus: docker installation

If docker is installed but docker compose is not available, you can install it through pip. https://docs.docker.com/compose/install/

pip install docker-compose

Stand alone installation

Here is the installation method of docker, which is also provided by the official Install using Kubernetes

  1. Download docker image file

    wget https://raw.githubusercontent.com/milvus-io/milvus/ecfebff801291934a3e6c5955e53637b993ab41a/deployments/docker/standalone/docker-compose.yml -O docker-compose.yml
    

    If you can't climb the wall, you can create a docker-compose.yml file and fill in the following contents:

    version: '3.5'
    
    services:
      etcd:
        container_name: milvus-etcd
        image: quay.io/coreos/etcd:v3.5.0
        volumes:
          - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/etcd:/etcd
        command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd
    
      minio:
        container_name: milvus-minio
        image: minio/minio:RELEASE.2020-12-03T00-03-10Z
        environment:
          MINIO_ACCESS_KEY: minioadmin
          MINIO_SECRET_KEY: minioadmin
        volumes:
          - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/minio:/minio_data
        command: minio server /minio_data
        healthcheck:
          test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
          interval: 30s
          timeout: 20s
          retries: 3
    
      standalone:
        container_name: milvus-standalone
        image: milvusdb/milvus:v2.0.0-rc4-20210811-bdb8396
        command: ["milvus", "run", "standalone"]
        environment:
          ETCD_ENDPOINTS: etcd:2379
          MINIO_ADDRESS: minio:9000
        volumes:
          - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/milvus:/var/lib/milvus
        ports:
          - "19530:19530"
        depends_on:
          - "etcd"
          - "minio"
    
    networks:
      default:
        name: milvus
    
  2. Pull the image and start it

docker-compose up -d

This is the startup command. When running for the first time, you need to pull the following three images online:

The default port of the service started is 19530, including the following three docker containers:

Distributed installation

https://milvus.io/cn/docs/v2.0.0/install_cluster-docker.md

Python SDK

Installation dependency

pip install pymilvus-orm==2.0.0rc4

The official website states that the latest version requires Python 3.6 or above, but the actual test requires Python 3.8 to install successfully (mainly because the version of pandas requires high requirements)

Connect to milvus service

from pymilvus_orm import connections, FieldSchema, CollectionSchema, DataType, Collection

# Connect to milvus server
connections.connect(host='localhost', port='19530')

##Create collection

collection must have a file D as the primary key and a file D as the storage vector. In addition, other types of files can be created

field_name = "example_field"


def create_collection():
    """
    Create collection collection
    :return:
    """
    collection_name = "example_collection"
    from pymilvus_orm import Collection, CollectionSchema, FieldSchema, DataType
    # Primary key
    field_id = FieldSchema(name="field_id", dtype=DataType.INT64, is_primary=True, auto_id=True)
    # field of vector retrieval
    field = FieldSchema(name=field_name, dtype=DataType.FLOAT_VECTOR, dim=8)
    cat_id = FieldSchema(name="cat_id", dtype=DataType.INT64)
    schema = CollectionSchema(fields=[field_id, field, cat_id], description="example collection")

    collection = Collection(name=collection_name, schema=schema)
    print(pymilvus_orm.utility.get_connection().has_collection(collection_name))
    print(pymilvus_orm.utility.get_connection().list_collections())

    return collection

collection can also store data in different partitions. By default, there is a "Default partition" partition. If the partition is not specified, it will be stored in the Default partition.

def create_partition(collection: Collection):
    """
    by collection Create partition
    :param collection:
    :return:
    """
    partition_name = "example_partition"
    partition = collection.create_partition(partition_name)

    print(collection.partitions)

    print(collection.has_partition(partition_name))

insert data

Insert data can be inserted into a specific partition according to actual needs.

  1. The data format of the current version can only be list, and neither can the ndarray of numpy;
  2. If the primary key is set to auto increment_ If id = true, there is no need to add the value of the primary key;
  3. After the data is inserted, it is stored in memory and needs to be transferred to disk. It can be used again next time.
def insert(collection: Collection, partition_name=None):
    """
    insert data
    :param partition_name: Specifies the partition to insert
    :param collection:
    :return:
    """
    # Due to primary key field_id is set to increase automatically, so there is no need to insert
    mr = collection.insert([
        # It can only be a list
        np.random.random([10000, 8]).tolist(),  # vector
        np.random.randint(0, 10, [10000]).tolist()  # cat_id
    ], partition_name=partition_name)
    print(mr.primary_keys)

    # The inserted data is stored in memory and needs to be transferred to disk
    pymilvus_orm.utility.get_connection().flush([collection.name])

Create index

The purpose of creating an index for the filed corresponding to the vector is to achieve efficient vector proximity search.

Currently supported index types include:

def create_index(collection: Collection):
    """
    Retrieved for vector field Create index
    :param collection:
    :return:
    """
    index_param = {
        "metric_type": "L2",
        "index_type": "IVF_FLAT",
        "params": {"nlist": 1024}
    }
    collection.create_index(field_name=field_name, index_params=index_param)
    print(collection.index().params)

query

In addition to general vector search, milvus also supports scalar filtering with expressions.

For example, in the following code, expr = "cat" is added_ Id = = 2 "condition: only in cat_ Search in the vector with ID 2 (cat is created above)_ ID).

However, the string filtering function is not supported at present, and it will be added in the official follow-up;

Relational operators (such as = =, >), logical operators (and & &, or |) and IN operators are supported.

def search(collection: Collection, partition_name=None):
    """
    Vector retrieval
    :param collection:
    :param partition_name: Retrieves the vector of the specified partition
    :return:
    """
    # Load collection into memory
    collection.load()
    search_params = {"metric_type": "L2", "params": {"nprobe": 10}}
    # Vector search
    result = collection.search(data=np.random.random([5, 8]).tolist(),
                               anns_field=field_name, param=search_params, limit=10,
                               partition_names=[partition_name] if partition_name else None)
    print(result[0].ids)
    print(result[0].distances)

    # Expressions: retrieving cat only_ Vector with ID 2
    result = collection.search(data=np.random.random([5, 8]).tolist(),
                               anns_field=field_name, param=search_params, limit=10,
                               expr="cat_id==2")
    print(result[0].ids)
    print(result[0].distances)

Delete data

At present, the following three deletion operations are supported

def drop(collection: Collection):
    # Delete collection
    collection.drop()
    # Delete index
    collection.drop_index()
    # delete a partition
    collection.drop_partition("partition_name")

release

def release(collection: Collection = None):
    # Free collection from memory
    if collection:
        collection.release()

    # Disconnect from the server and free up resources
    connections.disconnect("default")

Other SDK s

1.x version support:

At present, the 2.x version is only python, and others are still under development

Visual management

milvus is also powerful because it provides visual management tools.

It is also installed and started in the form of docker:

docker run -p 8000:3000 -e HOST_URL=http://{ your machine IP }:8000 -e MILVUS_URL={your machine IP}:19530 milvusdb/milvus-insight:latest

Note: the IP here cannot be written to localhost, otherwise connection problems may occur.

  1. View the collection load ed into memory

  2. View the structure and partition of collection, and support delete and import operations

  1. Online vector retrieval

Subsequent functions

In fact, milvus is powerful enough here, but they continue to support many powerful new functions.


Tags: Python Database Deep Learning

Posted on Thu, 02 Sep 2021 02:01:52 -0400 by azylka