Partition. py file section
The distributed crawler is mainly divided into five parts: (1) execution file (the beginning of the program), (2) obtaining xpath information, (3) obtaining required content, (4) time processing, and (5) data warehousing
We want to get the information of some websites by downloading the website content, copying the text, using the program to get the content and so on. Now, if we only want part of the content, not all the information, we can get the corresponding part through the program. At this time, the programming languages we can use include C, C + +, JAVA, Python and so on. Because Python is easy to use, the programming language we use this time is python. The tools we use include PyCharm, Oracle11G and so on.
Process and unify time format
Since the time indicated by different countries is different, we need to unify the format to facilitate our data addition, deletion, modification and query.
Normal time conversion
In most cases, we can directly use python's own functions to convert the time format and convert the time to datetime format.
Code display
Here are some common time conversion code slices.
The first is the international common time example: October 9, 2021 ① 2021 / 10 / 9 ② July 9 2021 ③ 9 July 2021, etc. the second is the time example with the date written in the front: October 9, 2021 ① 9 / 10 / 2021 ② 9-10-2021 ③ 9.10.2021, etc.
parse('2021/10/9', fuzzy=True).replace(tzinfo=None)
parse('9/10/2021', fuzzy=True, dayfirst=True).replace(tzinfo=None)
Buddhist calendar time conversion
In this case, we can't directly use python's own functions to convert the time format. At this time, special processing of time is required.
Code display
Here are some code slices of Buddhist calendar time conversion.
The month substitution here is in another function and will be shown at the end.
Because it involves many different writing methods, it is divided into many situations. This writing basically covers all the writing of the Buddhist calendar.
def foli_time() each_found = re.findall(r'\d+', str(original_time)) year_time = '' if month_time == '': if len(each_found[2]) == 4: year_time = str(int(each_found[2]) - 543) elif len(each_found[2]) == 2: year_time = '20' + str(int(each_found[2]) - 43) month_time = str(each_found[1]) each_found[2] = each_found[3] each_found[3] = each_found[4] else: if len(each_found[1]) == 4: year_time = str(int(each_found[2]) - 543) elif len(each_found[1]) == 2: year_time = '20' + str(int(each_found[1]) - 43) if len(each_found) >= 4: com_time = year_time + '-' + month_time + '-' + each_found[0] + ' ' + each_found[2] + ':' +\ each_found[3] return_time = datetime.strptime(com_time, "%Y-%m-%d %H:%M") else: com_time = year_time + '-' + month_time + '-' + each_found[0] return_time = datetime.strptime(com_time, "%Y-%m-%d") return return_time
Myanmar time conversion
Burmese time may even be replaced with numbers, so we list Burmese time conversion separately and replace the numbers in it for processing.
Code display
Here are some Myanmar time conversion code pieces.
The month substitution here is in another function and will be shown at the end.
def miandian_time() original_time= str(original_time).replace('၁', '1').replace('၂', '2').replace('၃', '3').replace('၄', '4').replace('၅', '5').replace('၆', '6').replace('၇', '7').replace('၈', '8').replace('၉', '9').replace('၀', '0') each_found = re.findall(r'\d+', original_time) if len(each_found) >= 4: com_time = each_found[1] + '-' + month_time + '-' + each_found[0] + ' ' + each_found[2] + ':' + each_found[3] return_time = datetime.strptime(com_time, "%Y-%m-%d %H:%M") else: com_time = each_found[1] + '-' + month_time + '-' + each_found[0] return_time = datetime.strptime(com_time, "%Y-%m-%d") return return_time
Time conversion of other months to be replaced
Some other countries may change the month to their own language, such as German, Italian, Khmer, Lao and so on.
Code display
Here are some other time conversion chips that need to replace the month.
The month substitution here is in another function and will be shown at the end.
def other_time() each_found = re.findall(r'\d+', original_time) if len(each_found) >= 4: com_time = each_found[1] + '-' + month_time + '-' + each_found[0] + ' ' + each_found[2] + ':' + each_found[3] return_time = datetime.strptime(com_time, "%Y-%m-%d %H:%M") else: com_time = each_found[1] + '-' + month_time + '-' + each_found[0] return_time = datetime.strptime(com_time, "%Y-%m-%d") return return_time
Interval conversion
In addition to the above situations, we may encounter the time expressed a few hours ago, a few days ago, yesterday, today, etc., so we need some other solutions.
Code display
Some interval conversion code slices are shown below.
def ago_time(at): if 'Minutes ago' in at: minutes = at[:at.find('Minutes ago')] return datetime.combine(datetime.now() - timedelta(minutes=int(minutes)), datetime.min.time()) elif 'Minutes ago' in at: minutes = at[:at.find('Minutes ago')] return datetime.combine(datetime.now() - timedelta(minutes=int(minutes)), datetime.min.time()) elif 'min' in at: minutes = at[:at.find('min')] return datetime.combine(datetime.now() - timedelta(minutes=int(minutes)), datetime.min.time()) elif 'Hours ago' in at: hour = at[:at.find('Hours ago')] return datetime.combine(datetime.now() - timedelta(hours=int(hour)), datetime.min.time()) elif 'Hours ago' in at: hour = at[:at.find('Hours ago')] return datetime.combine(datetime.now() - timedelta(hours=int(hour)), datetime.min.time()) elif 'hour' in at: hour = at[:at.find('hour')] return datetime.combine(datetime.now() - timedelta(hours=int(hour)), datetime.min.time()) elif 'hr' in at: hour = at[:at.find('hr')] return datetime.combine(datetime.now() - timedelta(hours=int(hour)), datetime.min.time()) elif 'today' in at or 'Today' in at or 'TODAY' in at or 'today' in at: return datetime.combine(datetime.now().date(), datetime.min.time()) elif 'yesterday' in at or 'Yesterday' in at or 'YESTERDAY' in at or 'yesterday' in at: return datetime.combine(datetime.now().date() - timedelta(days=1), datetime.min.time()) elif 'Days ago' in at: day = at[:at.find('Days ago')] return datetime.combine(datetime.now() - timedelta(days=int(day)), datetime.min.time()) elif 'day' in at: day = at[:at.find('day')] return datetime.combine(datetime.now() - timedelta(days=int(day)), datetime.min.time()) elif 'Weeks ago' in at: week = at[:at.find('Weeks ago')] return datetime.combine(datetime.now() - timedelta(weeks=int(week)), datetime.min.time()) elif 'Weeks ago' in at: week = at[:at.find('Weeks ago')] return datetime.combine(datetime.now() - timedelta(weeks=int(week)), datetime.min.time()) elif 'week' in at: week = at[:at.find('week')] return datetime.combine(datetime.now() - timedelta(weeks=int(week)), datetime.min.time()) elif 'Months ago' in at: month = at[:at.find('Months ago')] return datetime.combine(datetime.now() - timedelta(days=int(month) * 30), datetime.min.time()) elif 'Months ago' in at: month = at[:at.find('Months ago')] return datetime.combine(datetime.now() - timedelta(days=int(month) * 30), datetime.min.time()) elif 'month' in at: month = at[:at.find('month')] return datetime.combine(datetime.now() - timedelta(days=int(month) * 30), datetime.min.time()) else: try: return parse(at, fuzzy=True).replace(tzinfo=None) except Exception as e_time: print(e_time) return datetime.combine(datetime.now().date(), datetime.min.time())
Chinese Tibetan time conversion
Chinese time can be processed directly. Tibetan needs to change something first.
Code display
Here are some Chinese Tibetan time conversion code pieces.
Among them, I have encountered too little Tibetan. There is a certain possibility that I will encounter Tibetan time that I can't deal with.
if re.findall(r'སྤྱི་ཟླ་*. པའི་ཚེས་*.', str(original_time)): original_time = str(original_time).replace('༡', '1').replace('༢', '2').replace('༣', '3').replace('༤', '4').replace('༥', '5').replace('༦', '6').replace('༧', '7').replace('༨', '8').replace('༩', '9').replace('༠', '0') original_time = original_time.replace(r'སྤྱི་ཟླ་', '').replace(r' པའི་ཚེས་', '-').replace(r' ', ' ').replace(r' ', '-') if re.findall(r'\d{1,4}-\d{1,2}-\d{1,2} ཁུངས།', str(original_time)): original_time = ''.join(re.findall(r'\d{1,4}-\d{1,2}-\d{1,2} ཁུངས།', str(original_time))) if re.findall(r'སྤེལ་དུས།: \d{1,4}-\d{1,2}-\d{1,2} རྩོམ་པ་པོ།', str(original_time)): original_time = ''.join(re.findall(r'སྤེལ་དུས།: \d{1,4}-\d{1,2}-\d{1,2} རྩོམ་པ་པོ།', str(original_time))) if re.findall(r'\d{1,4}year\d{1,2}month\d{1,2}day\s\d{1,2}:\d{1,2}', str(original_time)): com_time = ''.join(re.findall(r'\d{1,4}year\d{1,2}month\d{1,2}day\s\d{1,2}:\d{1,2}', str(original_time))).replace('year', '-').replace('month', '-').replace('day', '') return_time = datetime.strptime(com_time, "%Y-%m-%d %H:%M") return return_time elif re.findall(r'\d{1,4}year\d{1,2}month\d{1,2}day\d{1,2}:\d{1,2}', str(original_time)): com_time = ''.join(re.findall(r'\d{1,4}year\d{1,2}month\d{1,2}day\d{1,2}:\d{1,2}', str(original_time))).replace('year', '-').replace('month', '-').replace('day', ' ') return_time = datetime.strptime(com_time, "%Y-%m-%d %H:%M") return return_time elif re.findall(r'\d{1,4}year\d{1,2}month\d{1,2}day', str(original_time)): com_time = ''.join(re.findall(r'\d{1,4}year\d{1,2}month\d{1,2}day', str(original_time))).replace('year', '-').replace('month', '-').replace('day', '') return_time = datetime.strptime(com_time, "%Y-%m-%d") return return_time elif re.findall(r'\d{1,4} ཟླ་བ་ \d{1,2} ཚེས \d{1,2}', str(original_time)): com_time = ''.join(re.findall(r'\d{1,4} ཟླ་བ་ \d{1,2} ཚེས \d{1,2}', str(original_time))).replace(' ཟླ་བ་ ', '-').replace(' ཚེས ', '-') return_time = datetime.strptime(com_time, "%Y-%m-%d") return return_time
Month replacement function
Some other countries may change the month to their own language, such as German, Italian, Khmer, Lao and so on. The main purpose of this function is to convert the months in these different languages into recognizable string numbers.
Code display
Here are some code snippets of month replacement functions.
def month_time_set(ot): month_time = '' if ("JAN" in ot) or ("Jan" in ot) or ("ม.ค." in ot) or ("มกราคม" in ot) or ("មករា" in ot) or ("ဇန္နဝါရီ၊" in ot) or ("ဇန်နဝါရီ" in ot) or ("ມັງກອນ" in ot) or ("Gennaio" in ot): month_time = '01' elif ("FEB" in ot) or ("Feb" in ot) or ("ก.พ." in ot) or ("กุมภาพันธ์" in ot) or ("កុម្ភៈ" in ot) or ("ေဖေဖာ္၀ါရီ၊" in ot) or ("ဖေဖော်ဝါရီ" in ot) or ("ກຸມພາ" in ot): month_time = '02' elif ("MAR" in ot) or ("Mar" in ot) or ("มี.ค." in ot) or ("มีนาคม" in ot) or ("មីនា" in ot) or ("Mär" in ot) or ("Maerz" in ot) or ("မတ္၊" in ot) or ("မတ်" in ot) or ("ມີນາ" in ot): month_time = '03' elif ("APR" in ot) or ("Apr" in ot) or ("เม.ย." in ot) or ("เมษายน" in ot) or ("មេសា" in ot) or ("ဧၿပီ၊" in ot) or ("ဧပြီ" in ot) or ("ເມສາ" in ot): month_time = '04' elif ("MAY" in ot) or ("May" in ot) or ("พ.ค." in ot) or ("พฤษภาคม" in ot) or ("ឧសភា" in ot) or ("Mai" in ot) or ("ေမ၊" in ot) or ("မေ" in ot) or ("ພຶດສະພາ" in ot) or ("Maggio" in ot): month_time = '05' elif ("JUN" in ot) or ("Jun" in ot) or ("มิ.ย." in ot) or ("มิถุนายน" in ot) or ("មិថុនា" in ot) or ("ဇြန္၊" in ot) or ("ဇွန်" in ot) or ("ມິຖຸນາ" in ot) or ("Giugno" in ot): month_time = '06' elif ("JUL" in ot) or ("Jul" in ot) or ("ก.ค." in ot) or ("กรกฎาคม" in ot) or ("កក្កដា" in ot) or ("ဇူလိုင္၊" in ot) or ("ဇူလိုင်" in ot) or ("ກໍລະກົດ" in ot) or ("Luglio" in ot): month_time = '07' elif ("AUG" in ot) or ("Aug" in ot) or ("ส.ค." in ot) or ("สิงหาคม" in ot) or ("សីហា" in ot) or ("ၾသဂုတ္၊" in ot) or ("ဩဂုတ်" in ot) or ("ြဂုတ်" in ot) or ("ສິງຫາ" in ot) or ("Agosto" in ot): month_time = '08' elif ("SEP" in ot) or ("Sep" in ot) or ("ก.ย." in ot) or ("กันยายน" in ot) or ("កញ្" in ot) or ("စက္တင္ဘာ၊" in ot) or ("စက်တင်ဘာ" in ot) or ("ກັນຍາ" in ot) or ("Settembre" in ot): month_time = '09' elif ("OCT" in ot) or ("Oct" in ot) or ("ต.ค." in ot) or ("ตุลาคม" in ot) or ("តុលា" in ot) or ("Okt" in ot) or ("ေအာက္တိုဘာ၊" in ot) or ("အောက်တိုဘာ" in ot) or ("ຕຸລາ" in ot) or ("Ottobre" in ot): month_time = '10' elif ("NOV" in ot) or ("Nov" in ot) or ("พ.ย." in ot) or ("พฤศจิกายน" in ot) or ("វិច្ឆិកា" in ot) or ("ႏိုဝင္ဘာ၊" in ot) or ("နိုဝင်ဘာ" in ot) or ("ພະຈິກ" in ot): month_time = '11' elif ("DEC" in ot) or ("Dec" in ot) or ("ธ.ค." in ot) or ("ธันวาคม" in ot) or ("ធ្នូ" in ot) or ("Dez" in ot) or ("ဒီဇင္ဘာ၊" in ot) or ("ဒီဇင်ဘာ" in ot) or ("ທັນວາ" in ot) or ("Dicembre" in ot): month_time = '12' return month_time