Data structure big job space and text query efficiency analysis

Article catalog

preface

The main content and code are all told in class, just post it.

Task 1 merchant location point query

Topic

Single point time limit: 24.0 sec
Memory limit: 2048 MB

With the popularity of smart phones, geographic information has been widely used in apps such as goldmaps, popular reviews, hungry? And so on. The final assignment of this data structure and algorithm course will simulate the query demand in real life and complete the search task based on geographic information and text information.

Each merchant in the system includes the following three items:
1. Position (x, y) (x, y) (x, y), x > 0 x > 0 x > 0, Y > 0 Y > 0 Y > 0 Y > 0 Y > 0;
2. Merchant name, 12 12 12 bit a − Z A-Z A − Z string, excluding lowercase;
3. Cuisine, 66 6-bit string a − Z A-Z A − Z, excluding lowercase;


The query function is: the user inputs the name of the merchant he is interested in, such as KFC, and the program outputs the location information of KFC. This ensures that the result is unique when the query succeeds. When the query information does not exist, the query fails and the output is NULL.

Input format
Line 11 1: the number of merchants m m m and the number of queries n n n, m m m and n n n are integers, both of which are not more than 10910 ^ 9 109;
Line 2 − (m+1) 2 - (m+1) 2 − (m+1): merchant information, including merchant name, location x x x, location y y y and cuisine;
The last n n n lines: the information inquired, that is, each line includes the name of a merchant;


Output format
The output should be n n n lines. Each line corresponds to the result of each query, i.e. the location of the merchant x, y x, y x, y.

Example

i n p u t input input
5 3
JEANGEORGES 260036 14362 FRENCH
HAIDILAO 283564 13179 CHAFINGDIS
KFC 84809 46822 FASTFOOD
SAKEMATE 234693 37201 JAPANESE
SUBWAY 78848 96660 FASTFOOD
HAIDILAO
SAKEMATE
SUBWAY







o u t p u t output output
283564 13179
234693 37201
78848 96660

Thinking design

First, the task is divided into storage and discovery.

  1. storage
    If the sequential storage is adopted completely, there are many elements to be experienced when searching. In this paper, we consider the algorithm of similar chain address method to solve the conflict hash table. Create a vector array, use hash function to get the index value corresponding to each restaurant name, and then store the restaurant information into the vector. So one by one, they pasted the corresponding value buckets of the restaurant.

  2. lookup
    Locate the vector array corresponding to the entered restaurant name. At this time, the number of elements that need to be traversed has been reduced. On this basis, you can use sequential search.

  3. To do binary search, we need vector ordering and write an implementation of sequential insertion.

  4. You can use the restaurant name as the keyword, and use BST to store and search.

Implementation code

  • Merchant information storage and symbol overloading
#include<bits/stdc++.h>
using namespace std; 
typedef unsigned long long ULL; 
enum Error_code{not_present,overflow,duplicate_error,success,underflow};

struct info   //Store information 
{
	int lx,ly;
	string name,style;
	info(string &sname,int &llx,int &lly,string &sstyle);        
	info(const info &p);
};
bool operator > (const info &x, const info &y);
bool operator < (const info &x, const info &y);
bool operator <= (const info &x, const info &y);
ostream & operator << (ostream &output, info &a);

info::info(const info &p)
{
	lx=p.lx;
	ly=p.ly;
	name=p.name;
	style=p.style;
}

info::info(string &sname,int &llx,int &lly,string &sstyle)
{
	lx=llx;
	ly=lly;
	name=sname;
	style=sstyle;
}

bool operator < (const info &x, const info &y)
{
	return x.name < y.name;
}

bool operator > (const info &x, const info &y)
{
	return x.name>y.name;
}

bool operator <= (const info &x, const info &y)
{
	return x.name<=y.name;
}

ostream & operator << (ostream &output, info &a)
{
	output<< a.lx << " " << a.ly;
	return output;
}

hash_(sequential / binary)_search

  • Definition of answer class
const int hash_size=9997;
class Solution
{
	public:
		Solution();
		void just_do_it_sequential_search();  //Sequential search
		void just_do_it_binary_search();  //Dichotomy search
	private:
		string target; //Target to find
		ULL use_probe;  //hash value corresponding to the target
		vector<info> entry[hash_size];  
	protected:
		void sequential_search();
		ULL hash_position(const string &new_entry) const; //Calculate mapping value
		void init(string s_target);    //Get restaurant information to find
		void append(info a);   
		void orded_insert(info a); //Sequential insertion, for binary search
		void recursive_binary_search(int bottom, int top, int &position);
		void recursive_binary_search_2(int bottom, int top, int &position);
};

  • Implementation of search
//Sequential search
void Solution::just_do_it_sequential_search()
{
	//ifstream in("rand_800000_10.in");
	int m,n;
	cin >> m >> n;
	//in >> m >> n;
	for(int i=0;i<m;i++)
	{
		string name,style;
		int lx,ly;
		cin >> name >> lx >> ly >> style;
		//in >> name >> lx >> ly >> style;
		append(info(name,lx,ly,style));
	}
	for(int i=0;i<n;i++)
	{
		string find;
		cin >> find;
		//in >> find;
		init(find);
		sequential_search();
	}
}

void Solution::sequential_search()
{
	int len_entry;
	len_entry=entry[use_probe].size();
	for(int position=0;position<len_entry;position++)
	{
		if(entry[use_probe][position].name==target)
		{
			cout << entry[use_probe][position] << endl;
			return ;
		}
	}
	cout << "NULL" << endl;
	return ;
}
//Dichotomy search
void Solution::orded_insert(info a)
{
	ULL probe=hash_position(a.name);
	if(entry[probe].size()==0)
	{
		entry[probe].push_back(a);
		return ;
	}
	else if(a<entry[probe].front())
	{
		entry[probe].insert(entry[probe].begin(),a);
		return ;
	}
	else if(a>entry[probe].back())
	{
		entry[probe].push_back(a);
	}
	else
	{
        vector<info>::iterator it=entry[probe].begin();
        vector<info>::iterator temp;
        for (it; it!=entry[probe].end(); ++it)
		{    
            temp=(it+1);
            if (*it<a&&a<=*temp)
            {
                entry[probe].insert(it+1,a);
                break;
            }
        }
    	return ;
	}
}

void Solution::just_do_it_binary_search()
{
	//ifstream in("rand_800000_10.in");
	int m,n;
	cin >> m >> n;
	//in >> m >> n;
	for(int i=0;i<m;i++)
	{
		string name,style;
		int lx,ly;
		cin >> name >> lx >> ly >> style;
		//in >> name >> lx >> ly >> style;
		orded_insert(info(name,lx,ly,style));
	}
	for(int i=0;i<n;i++)
	{
		string find;
		cin >> find;
		//in >> find;
		init(find);
		int bottom=0;
		int top=entry[use_probe].size()-1;
		int position=-1;
		recursive_binary_search(bottom,top,position);
		//recursive_binary_search2(bottom,top,position);
	}
}

void Solution::recursive_binary_search(int bottom, int top, int &position)
{ 
	if (bottom<top) 
	{ 
		int mid = (bottom+top)/2;
		if (target>entry[use_probe][mid].name) 
			return recursive_binary_search(mid+1,top,position);
		else 
			return recursive_binary_search(bottom,mid,position);
	}
	else 
	{
		position=bottom; 
		if (entry[use_probe][position].name == target) 
		{
			cout << entry[use_probe][position] << endl;
			return ;
		}
		cout << "NULL" << endl;
		return ;
	}
}

void Solution::recursive_binary_search_2(int bottom, int top, int &position)
{ 
	if (bottom<=top) 
	{ 
		int mid = (bottom+top)/2;
		if(target==entry[use_probe][mid].name)
		{
			position=mid; 
			cout << entry[use_probe][position] << endl;
			return ;
		}
		if (target>entry[use_probe][mid].name) 
			return recursive_binary_search_2(mid+1,top,position);
		else 
			return recursive_binary_search_2(bottom,mid-1,position);
	}
	else 
	{
		cout << "NULL" << endl;
		return ;
	}
}
  • Other function implementation
Solution::Solution()
{
	
}

void Solution::init(string s_target)
{
	target=s_target;
	use_probe=hash_position(target);
}

void Solution::append(info a)
{
	ULL probe=hash_position(a.name);
	entry[probe].push_back(a);
}

ULL Solution::hash_position(const string &str) const
{
	ULL seed = 13;
    ULL hash = 0;
	for(int i=0;i<str.length();i++)
    {
        hash = (hash*seed+(str[i]-'A'))%hash_size;
    }
    return hash; 
}

  • main function
int main()
{
	Solution s;
	s.just_do_it_sequential_search();
	return 0;
}

Binary_search_tree

  • Binary tree node
struct Binary_node 
{
	info data;
	Binary_node *left;
	Binary_node *right;
	Binary_node( );
	Binary_node(const info &x);
};

Binary_node :: Binary_node( )
{
	left = NULL;
	right = NULL;
}

Binary_node :: Binary_node(const info &x)
{
	data = x;
	left = NULL;
	right = NULL;
}
  • Binary search tree
class Binary_search_tree
{
	public:
		Binary_search_tree( );
		Error_code insert(const info &new_data);
		Error_code tree_search(info &target) const;
	private:
		Binary_node *search_for_node(Binary_node* sub_root, const info &target) const;
		Error_code search_and_insert(Binary_node * &sub_root, const info &new_data);
		Binary_node *root;
		int count;
};

Binary_search_tree::Binary_search_tree()
{
	root = NULL;
	count=0;
}

Binary_node* Binary_search_tree::search_for_node(
Binary_node * sub_root, const info &target) const
{
	if (sub_root == NULL || sub_root->data == target)
		return sub_root;
	else if (sub_root->data < target)
		return search_for_node(sub_root->right, target);
	else return search_for_node(sub_root->left, target);
}

Error_code Binary_search_tree::tree_search(info &target) const
{
	Error_code result = success;
	Binary_node *found = search_for_node(root,target);
	if (found == NULL)
		result = not_present;
	else
		target = found->data;
	return result;
}

Error_code Binary_search_tree::insert(const info &new_data)
{
	Error_code result=search_and_insert(root, new_data);
	if(result==success)
		count++;
	return result;
}

Error_code Binary_search_tree::search_and_insert(
Binary_node * &sub_root, const info &new_data)
{
	if (sub_root == NULL) 
	{
		sub_root = new Binary_node(new_data);
		return success;
	}
	else if (new_data < sub_root->data)
		return search_and_insert(sub_root->left, new_data);
	else if (new_data > sub_root->data)
		return search_and_insert(sub_root->right, new_data);
	else return duplicate_error;
}
  • main function
int main()

	//ifstream in("rand_4000_1.in");
	int m,n;
	cin >> m >> n;
	//in >> m >> n;
	//cout << "m = " << m << " n = " << n << endl;
	Binary_search_tree my_tree;
	for(int i=0;i<m;i++)
	{
		string name,style;
		int lx,ly;
		cin >> name >> lx >> ly >> style;
		//in >> name >> lx >> ly >> style;
		my_tree.insert(info(name,lx,ly,style));
	}
	for(int i=0;i<n;i++)
	{
		string find;
		cin >> find;
		//in >> find;
		info target(find);
		Error_code p;
		p=my_tree.tree_search(target);
		if(my_tree.tree_search(target)!=not_present)
			cout << target << endl;
		else
			cout << "NULL" << endl;
	}
	return 0;
}

Data analysis

The search time of each algorithm under different data scale (20040000800000) and different data characteristics (ascending, descending, random) is calculated. Limited space does not show all test data.

Visualization:



  • Sequential search: the search time complexity is O (n), the search efficiency is the lowest and the time is the longest.

  • Binary search (forgetfulness, recognition): the search time complexity is O(log(n)), in the actual use process, the two binary search lines almost coincide. Because the algorithm requires the array to be ordered, a linear search size insertion algorithm is designed at the time of insertion, resulting in
    (a) When the data volume is small (200), the overhead of establishing binary search and the extra programming strength make it more expensive to use it than to use sequential search;
    (b) Inserts faster when input data is in order.

  • Binary search tree: the worst case of search time complexity is O(n), and the best case is O(log(n)). When the input data is ordered, it degenerates to the sequential search of linked list.

  • Hash + Sequential lookup: the time complexity of insertion, access and lookup is O(1) at best and O(n) at worst. The speed is greatly accelerated and the search time is the shortest.

Task 2 t o p − k top-k top − K merchant query

Topic

Single point time limit: 29.0 sec
Memory limit: 2048 MB

With the popularity of smart phones, geographic information has been widely used in apps such as goldmaps, popular reviews, hungry? And so on. The final assignment of this data structure course will simulate the query demand in real life and complete the search task based on geographic information and text information.

Each merchant in the system includes the following three items of information:
1. position (x, y) (x, y) (x, y), x > 0 x>0 x>0, Y > 0 y>0 y>0;
2. Merchant name, 12 12 12 bit a − Z A-Z A − Z string, excluding lowercase;
3. Cuisine, 66 6-bit string a − Z A-Z A − Z, excluding lowercase;


Your program needs to provide users with the following query functions: users input their own location points such as < u x, u y > < UX, u y > < UX, u y >, interested cuisines and integer k k k value. The program outputs the merchant name and distance from near to far. When the distance is equal, the dictionary order according to the merchant name shall prevail. This distance is rounded to the nearest 3 3 3 places after the decimal point. If the number of qualified merchants is less than k k, all the information of qualified merchants will be output. If there are any merchants that do not meet the conditions in a query, the output of blank line is enough.

Input format
Line 11 1: the number of merchants m m and the number of queries n n n, m m and n n n are integers, both of which are not more than 10910 ^ 9 109;
Line 2 − (m+1) 2 - (m+1) 2 − (m+1): merchant information, including merchant name, location x x x, location y y y and cuisine;
The last n n n lines: each line represents a query, including user's location u x ux ux and u y uy uy, menu name, k k k value;


Output format
Corresponding to each query, output the qualified merchant information in sequence, and each line corresponds to a merchant.

Example

i n p u t input input
5 2
MCDONALD 260036 14362 FASTFOOD
HAIDILAO 283564 13179 CHAFINGDIS
KFC 84809 46822 FASTFOOD
DONGLAISHUN 234693 37201 CHAFINGDIS
SUBWAY 78848 96660 FASTFOOD
28708 23547 FASTFOOD 2
18336 14341 CHAFINGDIS 3






o u t p u t output output
KFC 60737.532
SUBWAY 88653.992
DONGLAISHUN 217561.327
HAIDILAO 265230.545


Thinking design

  • Because the input data is public, test the data scale first

    The first line is the data scale N, the second line is the number of restaurants contained in each cuisine T, and the third line is the K value to be queried.

  • Consider sorting all data
    The first three algorithms with O(n^2) time complexity are not considered, only those with O(n log n) time complexity are considered, and it is difficult to complete the search within the specified time when the maximum data size is N=800000. In order to reduce the elements needed for traversal, hash similar to Task 1 is used_ Vector, using the cuisine as the key word to store the visit data. Here I use Hill sort, fast sort and heap sort, which are the same as books.

  • Quick_select
    A method of reducing treatment, which is similar to quick in general_ Sort, but only one side needs to be considered. The realization idea is
    (1) Use vector to store all data
    (2) Confirm pivot point. Select the last element of the cuisine in the array as pivot point
    (3) The Partition process is similar to quick_sort, put the ones smaller than the pivot point on the left and the ones larger than the pivot point on the right.
    (4) Judge the length L on the left side of the array at this time. If l and K are equal, it means that the smallest K of all elements has been found. If l > k, the data on the right side will be discarded and partition will be called to divide on the left side of the axis point. If l < K, partition will be continued on the right side to find the smallest L – k elements.
    (5) At the end of the search process, the leftmost elements of the array are the first k smallest elements that meet the requirements, but they are unordered. At this time, the heap sorting is called, and the output is after sorting.
    (6) Optimization 1: use hash similar to Task 1_ Vector, using cuisine as the key word to store the visit data, can effectively reduce the number of elements to traverse during partition, and optimize the time complexity to O(T). Through experiments, it can be seen that using a good hash function can get a high hit rate.
    (7) Optimization 2: the strategy of selecting the pivot point will affect the efficiency of the algorithm. Looking up the data, we know a bfprt algorithm, that is, after grouping all data, we take the median and then take the median as the pivot point.







  • BFPTR
    In the previous fast selection algorithm, I chose the last restaurant of this cuisine as the pivot point. The efficiency of the algorithm has a lot to do with the selection of pivot point. In BFPTR algorithm, the median of five medians is chosen as pivot every time, which makes the partition more reasonable and avoids the worst case (O (N 2) O (n ^ 2) O (N 2)). The algorithm steps are as follows:
    (1) Divide the n n n elements of the input array into n / 5 n/5 n/5 groups with 55 5 elements in each group, and at most one group consists of the remaining N% 5 N \% 5 N% 5 elements.
    (2) To find the median of each group in n / 5 n/5 n/5 groups, first insert and sort the elements of each group, and then select the median from the sorted sequence.
    (3) For the n / 5 n/5 n/5 median found in (2), recursively carry out steps (1) and (2) until only one number is left, that is, the median of the n / 5 n/5 n/5 element. After finding the median, find the corresponding subscript p p p P.
    (4) In the partition process, the pivot element subscript in the partition is p p p.




  • Priority queue
    It only needs to access the data once, which is very fast and efficient. The first k minimum values are obtained by changing the top of the big root reactor. The realization ideas are as follows:
    (1) Use vector to store all data
    (2) From the front to the back, take the information of k restaurants from vector and build a big root heap.
    (3) Continue to traverse the elements backward. If the cuisine is the same and the distance is less than the top of the heap, replace the top of the heap with the current element and rearrange the large root heap.
    (4) Finally, make a heap sort and output.
    (5) Optimization: using hash similar to Task 1_ Vector, which stores restaurant data with cuisine as key words, can effectively reduce the number of elements that need to be traversed during partition. Through experiments, it can be seen that a good hash function can get a high hit rate.





Implementation code

  • Merchant information storage and symbol overloading
#include<bits/stdc++.h>
using namespace std; 
const double eps = 1e-4;
typedef unsigned long long ULL; 

struct info   //Store information 
{
	double lx,ly,dis;
	string name,style;
	info(string &sname,double &llx,double &lly,string &sstyle);        
	info(const info &p);
};
bool operator > (const info &x, const info &y);
bool operator < (const info &x, const info &y);
ostream & operator << (ostream &output, info &a);

info::info(const info &p)
{
	lx=p.lx;
	ly=p.ly;
	name=p.name;
	style=p.style;
	dis=p.dis;
}

info::info(string &sname,double &llx,double &lly,string &sstyle)
{
	lx=llx;
	ly=lly;
	name=sname;
	style=sstyle;
	dis=0.0;
}

bool operator < (const info &x, const info &y)
{
	if(x.dis==y.dis)
		return x.name<y.name;
		return x.dis < y.dis;
}

bool operator > (const info &x, const info &y)
{
	if(x.dis==y.dis)
		return x.name>y.name;
	else
		return x.dis > y.dis;
}

ostream & operator << (ostream &output, info &a)
{
	output<< a.name << " " << setiosflags(ios::fixed|ios::showpoint)
			<< setprecision(3) << a.dis;
	return output;
}

Book sorting algorithm

  • Definition of answer class
const int hash_size=9997; 
class Solution
{
	public:
		Solution();
		void init(double num_x,double num_y,int num_k,string s_target);   //Update data before each round of query 
		void just_do_it();
		void append(info a);  //Add information 
	private:
		double use_x,use_y;
		int use_k;
		ULL use_probe;
		string target;
		vector<info> entry[hash_size];
		vector<info> out;
	protected:
		void search_print();   //Print results 
		void distance(int position);   //Calculate distance 
		ULL hash_position(const string &new_entry) const;
		void heap_sort();  //83
		void insert_heap(const info &current,int low,int high);
		void build_heap();
		void sort_interval(int start,int increment);
		void shell_sort(); //83
		int partition(int low,int high);
		void recursive_quick_sort(int low,int high);
		void quick_sort();//83
};

shell_sort

void Solution::shell_sort()
{
	int increment,start;
	increment=out.size();
	do{
		increment=increment/3+1;
		for(start=0;start<increment;start++)
			sort_interval(start,increment);
	}while(increment>1);
}

void Solution::sort_interval(int start,int increment)
{
	int first_unsorted;
	int place;
	for (first_unsorted=start+increment;first_unsorted<out.size();first_unsorted+=increment) 
		if(out[first_unsorted]<out[first_unsorted-increment])
		{
			place=first_unsorted;
			info current(out[first_unsorted]);
			do{
				out[place]=out[place-increment];
				place-=increment;
			}while(place>start&&out[place-increment]>current);
			out[place]=current;
		}
}

heap_sort

void Solution::heap_sort()
{
	int last_unsorted;
	build_heap();
	for(last_unsorted=out.size()-1;last_unsorted>0;last_unsorted--)
	{
		info current(out[last_unsorted]);
		out[last_unsorted]=out[0];
		insert_heap(current,0,last_unsorted-1);
	}
}

void Solution::insert_heap(const info &current,int low,int high)
{
	int large;
	large=2*low+1;
	while(large<=high)
	{
		if(large<high&&out[large]<out[large+1])
			large++;
		if(current<out[large])
		{
			out[low]=out[large];
			low=large;
			large=2*low+1;
		}
		else
			break;
	}
	out[low]=current;
}

void Solution::build_heap()
{
	int low;
	for(low=out.size()/2-1;low>=0;low--)
	{
		info current(out[low]);
		insert_heap(current,low,out.size()-1);  
	}
}

quick_sort

void Solution::recursive_quick_sort(int low,int high)
{
	int pivot_position;
	if(low<high)
	{
		pivot_position=partition(low,high);
		recursive_quick_sort(low,pivot_position-1);
		recursive_quick_sort(pivot_position+1,high);
	}
}

int Solution::partition(int low,int high)
{
	int i,last_small;
	swap(out[low],out[high]);
	info pivot(out[low]);
	last_small=low;
	for(int i=low+1;i<=high;i++)
	{
		if(out[i]<pivot)
		{
			last_small=last_small+1;
			swap(out[last_small],out[i]);
		}
	}
	swap(out[low],out[last_small]);
	return last_small;
}
  • Other function implementation
Solution::Solution()
{
	
}

void Solution::append(info a)
{
	ULL probe=hash_position(a.style);
	entry[probe].push_back(a);
}

void Solution::distance(int position)
{
	double ans;
	ans=sqrt(fabs(entry[use_probe][position].lx-use_x)*fabs(entry[use_probe][position].lx-use_x)+
		fabs(entry[use_probe][position].ly-use_y)*fabs(entry[use_probe][position].ly-use_y));
	entry[use_probe][position].dis=round(ans*1000)/1000;
}

void Solution::init(double num_x,double num_y,int num_k,string s_target)
{
	use_x=num_x;
	use_y=num_y;
	use_k=num_k;
	target=s_target;
	use_probe=hash_position(target);
	out.clear();
}

void Solution::just_do_it()
{
	for(int i=0;i<entry[use_probe].size();i++)
		if(entry[use_probe][i].style==target)
		{
			distance(i);
			out.push_back(entry[use_probe][i]);
		}
	int len=out.size();
	if(use_k>len)
		use_k=len;
	search_print();
}

ULL Solution::hash_position(const string &str) const
{
	unsigned int seed = 1313;
    ULL hash = 0;
    
	for(int i=0;i<str.length();i++)
    {
        hash = (hash*seed+(str[i]-'A'+1))%hash_size;
    }
    return hash;
}

void Solution::search_print()
{
	quick_sort();//heap_sort(),shell_sort() 
	int len_out=out.size();
	for(int i=0;i<len_out&&i<use_k;i++)
	{
		cout << out[i] << endl;
	}
}

  • main function
int main(int argc, char** argv)
{
	//ifstream in("800000_rand_query15times_2.in");
	int n,m;
	cin >> n >> m;
	//in >> n >> m;
	Solution s;
	for(int i=0;i<n;i++)
	{
		double lx,ly;
		string name,style;
		cin >> name >> lx >> ly >> style;
		//in >> name >> lx >> ly >> style;
		s.append(info(name,lx,ly,style));
	}
	for(int i=0;i<m;i++)
	{
		/*LARGE_INTEGER begin;
		LARGE_INTEGER end;
		LARGE_INTEGER frequ;
		QueryPerformanceFrequency(&frequ);
		QueryPerformanceCounter(&begin); //Get the start time*/
		double lx,ly;
		int lk;
		string ltarget;
		cin >> lx >> ly >> ltarget >> lk;
		//in >> lx >> ly >> ltarget >> lk;
		s.init(lx,ly,lk,ltarget);
		s.just_do_it();
		/*cout << "k = " << lk << endl; 
		QueryPerformanceCounter(&end);//Get the end time
		cout<<fixed<<"time :"<<(end.QuadPart - begin.QuadPart)/(double)frequ.QuadPart<<endl;
		QueryPerformanceFrequency(&frequ);
		cout << endl;
		cin.get();*/
	}
    return 0;
}

Quick selection

  • Definition of answer class
const int hash_size=99997; 
class Solution
{
	public:
		Solution();
		void init(double num_x,double num_y,int num_k,string s_target);   //Update data before each round of query 
		void append(info a);  //Add information 
		int partition(int low,int high);
		void topk(int low,int high,int k);
		void just_do_it();
		void search_print();   //Print results 
		void distance(int position);   //Calculate distance 
		ULL hash_position(const string &new_entry) const;
		void heap_sort();
		void insert_heap(const info &current,int low,int high);
		void build_heap();
	private:
		double use_x,use_y;
		int use_k;
		ULL use_probe;
		string target;
		vector<info> entry[hash_size];
		vector<info> res;
		vector<info> out;
};
  • Implementation of rapid selection
void Solution::topk(int low,int high,int k)
{
	if(low<high)
	{
		int pos=partition(low,high);
		if(pos==-1)
			return ;
		int len=pos-low+1;
		if(len==k)
			return ;
		else if(len<k)
			topk(pos+1,high,k-len);
		else
			topk(low,pos-1,k);
	}
	else
		return ;
}

int Solution::partition(int low,int high)
{
	int pivot_position=high;
	while(res[pivot_position].style!=target)
	    pivot_position--;
	distance(pivot_position);
	info pivot(res[pivot_position]);
	int last_small;
	last_small=low;
	for(int i=low;i<pivot_position;i++)
	{
		if(res[i].style==target)
		{
			distance(i);
			if(pivot>res[i])
			{
				swap(res[last_small],res[i]);
				last_small++;
			}
		}
	}
	swap(res[pivot_position],res[last_small]);
	return last_small;
}
  • Implementation of heap sort (other sort can be selected)
void Solution::heap_sort()
{
	int last_unsorted;
	build_heap();
	for(last_unsorted=out.size()-1;last_unsorted>0;last_unsorted--)
	{
		info current(out[last_unsorted]);
		out[last_unsorted]=out[0];
		insert_heap(current,0,last_unsorted-1);
	}
}

void Solution::insert_heap(const info &current,int low,int high)
{
	int large;
	large=2*low+1;
	while(large<=high)
	{
		if(large<high&&out[large]<out[large+1])
			large++;
		if(current<out[large])
		{
			out[low]=out[large];
			low=large;
			large=2*low+1;
		}
		else
			break;
	}
	out[low]=current;
}

void Solution::build_heap()
{
	int low;
	for(low=out.size()/2-1;low>=0;low--)
	{
		info current(out[low]);
		insert_heap(current,low,out.size()-1);  
	}
}
  • Other function implementation
Solution::Solution()
{
	
}

void Solution::append(info a)
{
	ULL probe=hash_position(a.style);
	entry[probe].push_back(a);
}

void Solution::distance(int position)
{
	double ans;
	ans=sqrt(fabs(res[position].lx-use_x)*fabs(res[position].lx-use_x)+
		fabs(res[position].ly-use_y)*fabs(res[position].ly-use_y));
	res[position].dis=round(ans*1000)/1000;
}

void Solution::init(double num_x,double num_y,int num_k,string s_target)
{
	use_x=num_x;
	use_y=num_y;
	use_k=num_k;
	target=s_target;
	use_probe=hash_position(target);
	out.clear();
	res.clear();
}

void Solution::just_do_it()
{
	res.assign(entry[use_probe].begin(),entry[use_probe].end());
	int len=res.size();
	if(use_k>len)
		use_k=len;
	topk(0,len-1,use_k);
	search_print();
}
  • The main function is the same as above

BFPTR

void Solution::just_do_it()
{
	for(int i=0;i<entry[use_probe].size();i++)
		if(entry[use_probe][i].style==target)
			res.push_back(entry[use_probe][i]);
	int len=res.size();
	if(use_k>len)
		use_k=len;
	int p=BFPRT( 0,len-1,use_k);
	search_print();
}

int Solution::Insert_Sort(int left, int right) //Insert sort
{
    int j;
    for (int i = left + 1; i <= right; i++)
    {
    	distance(i);
        info temp(res[i]);
        j = i - 1;
        while (j >= left && res[j] > temp)
        {
        	distance(j);
        	distance(j+1);
            res[j + 1] = res[j];
            j--;
        }
        res[j + 1] = temp;
    }

    return ((right - left) >> 1) + left;
}

int Solution::Get_Pivot_Index(int left, int right)  //Get pivot point
{
    if (right - left < 5)
        return Insert_Sort(left, right);
    int sub_right = left - 1;
    for (int i = left; i + 4 <= right; i += 5)
    {
        int index = Insert_Sort(i, i + 4);
        swap(res[++sub_right], res[index]);
    }

    return BFPRT(left, sub_right, ((sub_right - left + 1) >> 1) + 1);
}

int Solution::Partition(int left, int right, int pivot_index)  
{
    swap(res[pivot_index], res[right]); // Place the principal element at the end
	distance(right);
    int partition_index = left; // Tracking the dividing line
    for (int i = left; i < right; i++)
    {
    	distance(i);
        if (res[i] < res[right])
        {
            swap(res[partition_index++], res[i]); // The smaller ones are on the left
        }
    }

    swap(res[partition_index], res[right]); // Finally, change the principal element back

    return partition_index;
}

int Solution::BFPRT(int left, int right, int k)  //High low partition lookup
{
    int pivot_index = Get_Pivot_Index(left, right); // Get the median subscript of the median (i.e. the primary subscript)
    int partition_index = Partition(left, right, pivot_index); // Go back to dividing boundary
    int num = partition_index - left + 1;

    if (num == k)
        return partition_index;
    else if (num > k)
        return BFPRT(left, partition_index - 1, k);
    else
        return BFPRT(partition_index + 1, right, k - num);
}

Priority queue

  • Definition of answer class
const int hash_size=9997; 
class Solution
{
	public:
		Solution();
		void init(double num_x,double num_y,int num_k,string s_target);   //Update data before each round of query 
		void append(info a);  //Add information 
		void insert_heap(const info &current,int low,int high);   //Insert elements into the heap 
		void build_heap();  //Build a pile 
		void heap_sort();   //Heap sort 
		void just_do_it();  //Process, select data 
		void search_print();   //Print results 
		void distance(int position);   //Calculate distance 
		unsigned int hash_position(const string &new_entry) const;
	private:
		double use_x,use_y;
		int use_k;
		unsigned int use_probe;
		string target;
		vector<info> entry[hash_size];
		vector<info> res;
};
  • The implementation of selection and sorting
void Solution::just_do_it()
{
	int len_entry=entry[use_probe].size();
	int cnt=0,flag;
	for(flag=0;flag<len_entry&&cnt<use_k;flag++)
	{
		if(entry[use_probe][flag].style==target)
		{
			distance(flag);
			res.push_back(entry[use_probe][flag]);
			cnt++;
		}
	}
	use_k=res.size();
	build_heap();
	for(int i=flag;i<len_entry;i++)  //Choose one smaller than the top 
	{
		if(entry[use_probe][i].style==target)
		{
			distance(i);
			if(res[0]>entry[use_probe][i])
			{
				res[0]=entry[use_probe][i];
				build_heap();
			}
		}
	}
	heap_sort();
	search_print();
}

void Solution::build_heap()
{
	int low;
	for(low=use_k/2-1;low>=0;low--)
	{
		info current(res[low]);
		insert_heap(current,low,use_k-1);
	}
}

void Solution::insert_heap(const info &current,int low,int high)
{
	int large;
	large=2*low+1;
	while(large<=high)
	{
		if(large<high&&res[large]<res[large+1])
			large++;
		if(current<res[large])
		{
			res[low]=res[large];
			low=large;
			large=2*low+1;
		}
		else
			break;
	}
	res[low]=current;
}

void Solution::heap_sort()
{
	int last_unsorted;
	for(last_unsorted=use_k-1;last_unsorted>0;last_unsorted--)
	{
		info current(res[last_unsorted]);
		res[last_unsorted]=res[0];
		insert_heap(current,0,last_unsorted-1);
	}
}
  • The other functions and main functions are the same as above.

matters needing attention

  • When calculating the distance, pay attention to rounding to three decimal places
void Solution::distance(int position)
{
	double ans;
	ans=sqrt(fabs(entry[use_probe][position].lx-use_x)*fabs(entry[use_probe][position].lx-use_x)+
		fabs(entry[use_probe][position].ly-use_y)*fabs(entry[use_probe][position].ly-use_y));
	entry[use_probe][position].dis=round(ans*1000)/1000;
}

Data analysis

The search time of each algorithm under different data scale, different restaurant scale and different k k k value is calculated. Limited space does not show all test data. (added_ Hash is to use hash optimization)

Visualization:

  1. Under the same data scale N and the same cuisine scale T, considering the influence of different k values on the search time, we take the data scale N=800000 and the cuisine scale T=80000.
  • Quick_select: only the molecular array where the target is located is considered. The worst-case time complexity is O(n^2), and the average is O(n). And the search time is related to the choice of pivot, not to k value, so the polyline is relatively smooth.
    (a) Using hash thinking to store can reduce the number of elements to traverse n, and significantly improve efficiency.
    (b) Using bfprt can make the selection of pivot point more centered, but because of the need to select the median value, it is necessary to filter out the data of the same cuisine from the original data. In the case of large amount of data, the effect is very poor.

  • Max_heap: the time complexity is O(n*lg(k)), the change of curve slope conforms to the change of K, and using hash thinking to store can reduce the number of elements to traverse n, significantly improving efficiency.

  1. Under the same data scale N and the same K value, considering the influence of scale T on search time in different cuisines, the data scale N=800000, k=15
  • Using hash to store data: because the hash function used has a high hit rate, the probability of two different cuisines' visit information stored in a vector array is very small, so the size of the elements that the algorithm needs to operate is almost the same as the data size T under the cuisines, and a few conflicts are 2*T, the hit rate and hash_ The choice of size is related to hash function.

  • Quick_ Due to the rapid increase of the scale of the cuisine, the number of times of exchange and comparison of select algorithm is also increased dramatically, and the growth rate is similar to that of the scale T of the cuisine. Quick_ selec_ hash_ Because bfprt uses linear search to filter the restaurant information of the same cuisine, and uses insertion sorting to get the median, the optimization efficiency is not good under large-scale data, and the growth rate is similar to the growth rate of K. This is why these two algorithms can't pass the last two points in EOJ. And Quick_select_ The growth rate of hash algorithm is also similar to that of K. however, due to the high hit rate of hash, the number of elements to be traversed is much smaller than N, which can be passed by edge on EOJ.

  1. Due to the large difference of scale T under different data scales N, it is difficult to confirm a single variable with N as a variable, so it is not easy to analyze. The figure below is only for reference of output time trend, not for specific analysis.

  • In practice, Max_ The efficiency of heap is better than Quick_select, guess because quick is in use_ When selecting, a lot of elements are filtered, compared and transferred, and then sorting is needed. These steps take a lot of time, resulting in its actual efficiency is not as good as Max_heap.
  1. This paper analyzes the characteristics of the bfprt algorithm used to obtain the pivot point under the small-scale data

  • It can be seen from the trend of line change in the figure that quick can find any axis point_ The stability of select algorithm is not good, and the efficiency of searching depends on the size of the axis to a large extent, so the curve fluctuates greatly. However, the axis point searched by bfptr algorithm is closer to the median value of the array, so it is more stable and the curve fluctuation is smaller when searching under the same scale.
  • bfptr algorithm is used to find the axis points closer to the median, and the binary operation is closer to the optimal situation. In the case of small data size, the time of filtering and selecting values is less, the optimization effect is more obvious, and the search time is less.

Summary

  • The food is still the food. It took a week to finish the big homework, but in the end, ac was very cool. The biggest feeling is that as long as you keep submitting, you will always be able to ac.
  • The biggest happiness of data structure class for me is that I can use my knowledge to solve some practical problems. Go back to the maze, play Gobang, and search the restaurant this time.

reference material

  1. Three methods and analysis of seeking TOPK: https://blog.csdn.net/coder_panyy/article/details/76359368

  2. BFPRT algorithm (TOP-K problem): https://ethsonliu.com/2018/03/bfprt.html

  3. [data structure] summary of nine basic sorting algorithms: https://blog.csdn.net/yuchenchenyi/article/details/51470500

  4. bfptr algorithm (i.e. median algorithm of median): https://blog.csdn.net/a3192048/article/details/82055183

  5. Data structure and programming description in C + +, by Robert L. Kruse and Alexander J. Ryba, translated by Qian Liping

supplement

After the teacher's explanation and big guy's sharing, the best way is to use k-nearest neighbor algorithm or nearest neighbor search. But because I'm very good at cooking, I didn't know about it. I just want to make a record here. I hope I'll learn more later.

Tags: less Programming iOS

Posted on Sat, 27 Jun 2020 00:03:02 -0400 by 2DaysAway