Python distributed general purpose crawler

Partition. py file section

The distributed crawler is mainly divided into five parts: (1) execution file (the beginning of the program), (2) obtaining xpath information, (3) obtaining required content, (4) time processing, and (5) data warehousing

We want to get the information of some websites by downloading the website content, copying the text, using the program to get the content and so on. Now, if we only want part of the content, not all the information, we can get the corresponding part through the program. At this time, the programming languages we can use include C, C + +, JAVA, Python and so on. Because Python is easy to use, the programming language we use this time is python. The tools we use include PyCharm, Oracle11G and so on.

Process and unify time format

Since the time indicated by different countries is different, we need to unify the format to facilitate our data addition, deletion, modification and query.

Normal time conversion

In most cases, we can directly use python's own functions to convert the time format and convert the time to datetime format.

Code display

Here are some common time conversion code slices.
The first is the international common time example: October 9, 2021 ① 2021 / 10 / 9 ② July 9 2021 ③ 9 July 2021, etc. the second is the time example with the date written in the front: October 9, 2021 ① 9 / 10 / 2021 ② 9-10-2021 ③ 9.10.2021, etc.

parse('2021/10/9', fuzzy=True).replace(tzinfo=None)
parse('9/10/2021', fuzzy=True, dayfirst=True).replace(tzinfo=None)

Buddhist calendar time conversion

In this case, we can't directly use python's own functions to convert the time format. At this time, special processing of time is required.

Code display

Here are some code slices of Buddhist calendar time conversion.
The month substitution here is in another function and will be shown at the end.
Because it involves many different writing methods, it is divided into many situations. This writing basically covers all the writing of the Buddhist calendar.

def foli_time()
	each_found = re.findall(r'\d+', str(original_time))
	year_time = ''
	if month_time == '':
	    if len(each_found[2]) == 4:
	        year_time = str(int(each_found[2]) - 543)
	    elif len(each_found[2]) == 2:
	        year_time = '20' + str(int(each_found[2]) - 43)
	    month_time = str(each_found[1])
	    each_found[2] = each_found[3]
	    each_found[3] = each_found[4]
	else:
	    if len(each_found[1]) == 4:
	        year_time = str(int(each_found[2]) - 543)
	    elif len(each_found[1]) == 2:
	        year_time = '20' + str(int(each_found[1]) - 43)
	if len(each_found) >= 4:
	    com_time = year_time + '-' + month_time + '-' + each_found[0] + ' ' + each_found[2] + ':' +\
	               each_found[3]
	    return_time = datetime.strptime(com_time, "%Y-%m-%d %H:%M")
	else:
	    com_time = year_time + '-' + month_time + '-' + each_found[0]
	    return_time = datetime.strptime(com_time, "%Y-%m-%d")
	return return_time

Myanmar time conversion

Burmese time may even be replaced with numbers, so we list Burmese time conversion separately and replace the numbers in it for processing.

Code display

Here are some Myanmar time conversion code pieces.
The month substitution here is in another function and will be shown at the end.

def miandian_time()
	original_time= str(original_time).replace('၁', '1').replace('၂', '2').replace('၃', '3').replace('၄', '4').replace('၅', '5').replace('၆', '6').replace('၇', '7').replace('၈', '8').replace('၉', '9').replace('၀', '0')
	each_found = re.findall(r'\d+', original_time)
	if len(each_found) >= 4:
	    com_time = each_found[1] + '-' + month_time + '-' + each_found[0] + ' ' + each_found[2] + ':' + each_found[3]
	    return_time = datetime.strptime(com_time, "%Y-%m-%d %H:%M")
	else:
	    com_time = each_found[1] + '-' + month_time + '-' + each_found[0]
	    return_time = datetime.strptime(com_time, "%Y-%m-%d")
	return return_time

Time conversion of other months to be replaced

Some other countries may change the month to their own language, such as German, Italian, Khmer, Lao and so on.

Code display

Here are some other time conversion chips that need to replace the month.
The month substitution here is in another function and will be shown at the end.

def other_time()
	each_found = re.findall(r'\d+', original_time)
	if len(each_found) >= 4:
	    com_time = each_found[1] + '-' + month_time + '-' + each_found[0] + ' ' + each_found[2] + ':' + each_found[3]
	    return_time = datetime.strptime(com_time, "%Y-%m-%d %H:%M")
	else:
	    com_time = each_found[1] + '-' + month_time + '-' + each_found[0]
	    return_time = datetime.strptime(com_time, "%Y-%m-%d")
	return return_time

Interval conversion

In addition to the above situations, we may encounter the time expressed a few hours ago, a few days ago, yesterday, today, etc., so we need some other solutions.

Code display

Some interval conversion code slices are shown below.

def ago_time(at):
    if 'Minutes ago' in at:
        minutes = at[:at.find('Minutes ago')]
        return datetime.combine(datetime.now() - timedelta(minutes=int(minutes)), datetime.min.time())
    elif 'Minutes ago' in at:
        minutes = at[:at.find('Minutes ago')]
        return datetime.combine(datetime.now() - timedelta(minutes=int(minutes)), datetime.min.time())
    elif 'min' in at:
        minutes = at[:at.find('min')]
        return datetime.combine(datetime.now() - timedelta(minutes=int(minutes)), datetime.min.time())
    elif 'Hours ago' in at:
        hour = at[:at.find('Hours ago')]
        return datetime.combine(datetime.now() - timedelta(hours=int(hour)), datetime.min.time())
    elif 'Hours ago' in at:
        hour = at[:at.find('Hours ago')]
        return datetime.combine(datetime.now() - timedelta(hours=int(hour)), datetime.min.time())
    elif 'hour' in at:
        hour = at[:at.find('hour')]
        return datetime.combine(datetime.now() - timedelta(hours=int(hour)), datetime.min.time())
    elif 'hr' in at:
        hour = at[:at.find('hr')]
        return datetime.combine(datetime.now() - timedelta(hours=int(hour)), datetime.min.time())
    elif 'today' in at or 'Today' in at or 'TODAY' in at or 'today' in at:
        return datetime.combine(datetime.now().date(), datetime.min.time())
    elif 'yesterday' in at or 'Yesterday' in at or 'YESTERDAY' in at or 'yesterday' in at:
        return datetime.combine(datetime.now().date() - timedelta(days=1), datetime.min.time())
    elif 'Days ago' in at:
        day = at[:at.find('Days ago')]
        return datetime.combine(datetime.now() - timedelta(days=int(day)), datetime.min.time())
    elif 'day' in at:
        day = at[:at.find('day')]
        return datetime.combine(datetime.now() - timedelta(days=int(day)), datetime.min.time())
    elif 'Weeks ago' in at:
        week = at[:at.find('Weeks ago')]
        return datetime.combine(datetime.now() - timedelta(weeks=int(week)), datetime.min.time())
    elif 'Weeks ago' in at:
        week = at[:at.find('Weeks ago')]
        return datetime.combine(datetime.now() - timedelta(weeks=int(week)), datetime.min.time())
    elif 'week' in at:
        week = at[:at.find('week')]
        return datetime.combine(datetime.now() - timedelta(weeks=int(week)), datetime.min.time())
    elif 'Months ago' in at:
        month = at[:at.find('Months ago')]
        return datetime.combine(datetime.now() - timedelta(days=int(month) * 30), datetime.min.time())
    elif 'Months ago' in at:
        month = at[:at.find('Months ago')]
        return datetime.combine(datetime.now() - timedelta(days=int(month) * 30), datetime.min.time())
    elif 'month' in at:
        month = at[:at.find('month')]
        return datetime.combine(datetime.now() - timedelta(days=int(month) * 30), datetime.min.time())
    else:
        try:
            return parse(at, fuzzy=True).replace(tzinfo=None)
        except Exception as e_time:
            print(e_time)
            return datetime.combine(datetime.now().date(), datetime.min.time())

Chinese Tibetan time conversion

Chinese time can be processed directly. Tibetan needs to change something first.

Code display

Here are some Chinese Tibetan time conversion code pieces.
Among them, I have encountered too little Tibetan. There is a certain possibility that I will encounter Tibetan time that I can't deal with.

if re.findall(r'སྤྱི་ཟླ་*. པའི་ཚེས་*.', str(original_time)):
    original_time = str(original_time).replace('༡', '1').replace('༢', '2').replace('༣', '3').replace('༤', '4').replace('༥', '5').replace('༦', '6').replace('༧', '7').replace('༨', '8').replace('༩', '9').replace('༠', '0')
    original_time = original_time.replace(r'སྤྱི་ཟླ་', '').replace(r' པའི་ཚེས་', '-').replace(r'  ', ' ').replace(r' ', '-')
if re.findall(r'\d{1,4}-\d{1,2}-\d{1,2}    ཁུངས།', str(original_time)):
    original_time = ''.join(re.findall(r'\d{1,4}-\d{1,2}-\d{1,2}    ཁུངས།', str(original_time)))
if re.findall(r'སྤེལ་དུས།: \d{1,4}-\d{1,2}-\d{1,2}  རྩོམ་པ་པོ།', str(original_time)):
    original_time = ''.join(re.findall(r'སྤེལ་དུས།: \d{1,4}-\d{1,2}-\d{1,2}  རྩོམ་པ་པོ།', str(original_time)))
if re.findall(r'\d{1,4}year\d{1,2}month\d{1,2}day\s\d{1,2}:\d{1,2}', str(original_time)):
    com_time = ''.join(re.findall(r'\d{1,4}year\d{1,2}month\d{1,2}day\s\d{1,2}:\d{1,2}', str(original_time))).replace('year', '-').replace('month', '-').replace('day', '')
    return_time = datetime.strptime(com_time, "%Y-%m-%d %H:%M")
    return return_time
elif re.findall(r'\d{1,4}year\d{1,2}month\d{1,2}day\d{1,2}:\d{1,2}', str(original_time)):
    com_time = ''.join(re.findall(r'\d{1,4}year\d{1,2}month\d{1,2}day\d{1,2}:\d{1,2}', str(original_time))).replace('year', '-').replace('month', '-').replace('day', ' ')
    return_time = datetime.strptime(com_time, "%Y-%m-%d %H:%M")
    return return_time
elif re.findall(r'\d{1,4}year\d{1,2}month\d{1,2}day', str(original_time)):
    com_time = ''.join(re.findall(r'\d{1,4}year\d{1,2}month\d{1,2}day', str(original_time))).replace('year', '-').replace('month', '-').replace('day', '')
    return_time = datetime.strptime(com_time, "%Y-%m-%d")
    return return_time
elif re.findall(r'\d{1,4} ཟླ་བ་ \d{1,2} ཚེས \d{1,2}', str(original_time)):
    com_time = ''.join(re.findall(r'\d{1,4} ཟླ་བ་ \d{1,2} ཚེས \d{1,2}', str(original_time))).replace(' ཟླ་བ་ ', '-').replace(' ཚེས ', '-')
    return_time = datetime.strptime(com_time, "%Y-%m-%d")
    return return_time

Month replacement function

Some other countries may change the month to their own language, such as German, Italian, Khmer, Lao and so on. The main purpose of this function is to convert the months in these different languages into recognizable string numbers.

Code display

Here are some code snippets of month replacement functions.

def month_time_set(ot):
    month_time = ''
    if ("JAN" in ot) or ("Jan" in ot) or ("ม.ค." in ot) or ("มกราคม" in ot) or ("មករា" in ot) or ("ဇန္နဝါရီ၊" in ot) or ("ဇန်နဝါရီ" in ot) or ("ມັງກອນ" in ot) or ("Gennaio" in ot):
        month_time = '01'
    elif ("FEB" in ot) or ("Feb" in ot) or ("ก.พ." in ot) or ("กุมภาพันธ์" in ot) or ("កុម្ភៈ" in ot) or ("ေဖေဖာ္၀ါရီ၊" in ot) or ("ဖေဖော်ဝါရီ" in ot) or ("ກຸມພາ" in ot):
        month_time = '02'
    elif ("MAR" in ot) or ("Mar" in ot) or ("มี.ค." in ot) or ("มีนาคม" in ot) or ("មីនា" in ot) or ("Mär" in ot) or ("Maerz" in ot) or ("မတ္၊" in ot) or ("မတ်" in ot) or ("ມີນາ" in ot):
        month_time = '03'
    elif ("APR" in ot) or ("Apr" in ot) or ("เม.ย." in ot) or ("เมษายน" in ot) or ("មេសា" in ot) or ("ဧၿပီ၊" in ot) or ("ဧပြီ" in ot) or ("ເມສາ" in ot):
        month_time = '04'
    elif ("MAY" in ot) or ("May" in ot) or ("พ.ค." in ot) or ("พฤษภาคม" in ot) or ("ឧសភា" in ot) or ("Mai" in ot) or ("ေမ၊" in ot) or ("မေ" in ot) or ("ພຶດສະພາ" in ot) or ("Maggio" in ot):
        month_time = '05'
    elif ("JUN" in ot) or ("Jun" in ot) or ("มิ.ย." in ot) or ("มิถุนายน" in ot) or ("មិថុនា" in ot) or ("ဇြန္၊" in ot) or ("ဇွန်" in ot) or ("ມິຖຸນາ" in ot) or ("Giugno" in ot):
        month_time = '06'
    elif ("JUL" in ot) or ("Jul" in ot) or ("ก.ค." in ot) or ("กรกฎาคม" in ot) or ("កក្កដា" in ot) or ("ဇူလိုင္၊" in ot) or ("ဇူလိုင်" in ot) or ("ກໍລະກົດ" in ot) or ("Luglio" in ot):
        month_time = '07'
    elif ("AUG" in ot) or ("Aug" in ot) or ("ส.ค." in ot) or ("สิงหาคม" in ot) or ("សីហា" in ot) or ("ၾသဂုတ္၊" in ot) or ("ဩဂုတ်" in ot) or ("ြဂုတ်" in ot) or ("ສິງຫາ" in ot) or ("Agosto" in ot):
        month_time = '08'
    elif ("SEP" in ot) or ("Sep" in ot) or ("ก.ย." in ot) or ("กันยายน" in ot) or ("កញ្" in ot) or ("စက္တင္ဘာ၊" in ot) or ("စက်တင်ဘာ" in ot) or ("ກັນຍາ" in ot) or ("Settembre" in ot):
        month_time = '09'
    elif ("OCT" in ot) or ("Oct" in ot) or ("ต.ค." in ot) or ("ตุลาคม" in ot) or ("តុលា" in ot) or ("Okt" in ot) or ("ေအာက္တိုဘာ၊" in ot) or ("အောက်တိုဘာ" in ot) or ("ຕຸລາ" in ot) or ("Ottobre" in ot):
        month_time = '10'
    elif ("NOV" in ot) or ("Nov" in ot) or ("พ.ย." in ot) or ("พฤศจิกายน" in ot) or ("វិច្ឆិកា" in ot) or ("ႏိုဝင္ဘာ၊" in ot) or ("နိုဝင်ဘာ" in ot) or ("ພະຈິກ" in ot):
        month_time = '11'
    elif ("DEC" in ot) or ("Dec" in ot) or ("ธ.ค." in ot) or ("ธันวาคม" in ot) or ("ធ្នូ" in ot) or ("Dez" in ot) or ("ဒီဇင္ဘာ၊" in ot) or ("ဒီဇင်ဘာ" in ot) or ("ທັນວາ" in ot) or ("Dicembre" in ot):
        month_time = '12'
    return month_time

Tags: Python crawler

Posted on Sat, 09 Oct 2021 14:22:02 -0400 by reflash