Analysis of question answering system based on medical knowledge map

Knowledge map has been more and more applied in question answering system, semantic search and other fields. It is a hot research issue in the field of artificial intelligence. The author recently worked on the project qabasedon medica knowledge graph (github link: https://github.com/liuhuanyong/QASystemOnMedicalKG )It has carried on a more systematic carding, benefited a lot, and has also carried on some immature thinking to the further improvement of the project. Write this blog for the purpose of sharing and preventing forgetting.
1, Main structure of the project
The construction of knowledge map, question and answer based on knowledge map.
2, Construction of knowledge map
(1) Crawling medical data with disease as the core has a large amount of data. The running time of crawling data program of ordinary personal computer takes a day. It needs to import crawling data into mongodb, a non relational database. See the script data_spider.py.
(2) Clean the crawled data into the dictionary form of "node: attribute" which can be directly imported into the graph database neo4j, and import the cleaned data into mongodb, see the script build_data.py. (because the first_name.txt file required for data cleaning is missing, the author analyzes the code of build_data.py, adds and deletes it properly in the crawler part of the original project, and compiles the first_name_spider.py script to crawl the first_name.txt. For readers' needs, please go to the following link: https://pan.baidu.com/s/1LNMyffgl4Qic1EryFDimkg Download, extraction code: 37ul)

#!/usr/bin/python3
# -*- coding:utf-8 -*-

"""
@Author  : heyw
@Contact : he_yuanwen@126.com
@Time    : 2020/1/31 21:25
@Software: PyCharm
@FileName: first_name_spider.py
"""
import urllib.request
import urllib.parse
from lxml import etree



'''Doctor's name'''
class CrimeSpider:
    def __init__(self):
        pass

    '''according to url,request html'''
    def get_html(self, url):
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '
                                 'Chrome/51.0.2704.63 Safari/537.36'}
        req = urllib.request.Request(url=url, headers=headers)
        res = urllib.request.urlopen(req)
        html = res.read().decode('gbk')
        return html

    '''url analysis'''
    def url_parser(self, content):
        selector = etree.HTML(content)
        urls = ['http://www.anliguan.com' + i for i in  selector.xpath('//h2[@class="item-title"]/a/@href')]
        return urls

    '''test'''
    def spider_main(self):
        namesList = []
        for page in range(1, 11000):
            try:
                symptom_url = 'http://jib.xywy.com/il_sii/symptom/%s.htm'%page
                names = self.name_spider(symptom_url)
                namesList.extend(names)
                print(page, symptom_url)
            except Exception as e:
                print(e, page)

        self.save_names(namesList)
        return

    def save_names(self,namesList):
        with open('first_name.txt','w',encoding='utf-8') as f:
            for name in namesList:
                f.write(name + "\n")

    '''Name information analysis'''
    def name_spider(self, url):
        html = self.get_html(url)
        selector = etree.HTML(html)
        names = selector.xpath('//span[@class="fr replay-docter"]//a[@class="gre"]/text()')
        return names


if __name__ == '__main__':
    handler = CrimeSpider()
    handler.spider_main()

(3) According to the data in the form of dictionaries, nodes are created, and disease centered relationships are defined to form knowledge represented by triples. Nodes and relationships are imported into the neo4j database to form a knowledge map, as shown in the script build ﹣ medicalgraph.py.
3, Problem system composition
Question system is the core part of the project, which consists of three parts: question classification, question analysis and answer search.
(1) Question classification. The original project of question classification adopts the method of "two steps". The first step is to identify the key words in the problem with AC automata algorithm, and confirm the type of key words according to the large dictionary of "entity: type" constructed in advance. The second step is based on the rule method, according to the experience of building several kinds of question keyword dictionary, according to whether these keywords exist in the question to determine the type of question. In fact, the determination of type is a multi classification problem. Question classification is the core of the core, the author believes that the approach here can be further optimized, see the fourth part.
(2) Question analysis. The essence of question parsing is to select the appropriate match sentence of neo4j according to the type of question.
(3) Answer search. Visit the neo4j database, execute the question parsing to get the match statement, and transform the execution result into a form acceptable to users.
4, Project improvement suggestions
Just as question classification is the core of the core, the improvement of this paper focuses on this aspect, and focuses on keyword recognition.
The original project is completed by multi-mode matching algorithm AC automata, which effectively guarantees the recognition efficiency of keywords. However, AC automata is a complete matching algorithm, which is easy to lead to matching failure. Here, entity recognition can be used to identify keywords. Entity recognition will bring about a problem, that is, the failure of large dictionaries. To solve this problem, we can build a text classification model.

Published 5 original articles, praised 0, visited 90
Private letter follow

Tags: Database github MongoDB Attribute

Posted on Fri, 31 Jan 2020 23:55:59 -0500 by ClevaTreva