Python is a great programming language. In fact, it is also one of the fastest growing programming languages in the world. It has proved its practicability in data science position again and again. The ecosystem of Python and its libraries makes it the right choice for users around the world (beginner and advanced).
In this article, we will introduce some Python libraries for data science. They are not as well-known as pandas, scikit learn and matplotlib, but they are also very practical.
Extracting data, especially from the network, is one of the main tasks of data scientists. wget is a free utility for non interactive file downloads from the Web. It supports HTTP, HTTPS and FTP protocols, as well as retrieval through HTTP proxy. Because it is non interactive, it can run in the background even if the user is not logged in. So if you need to download all the pictures from a website or page, wget can help you
$ pip install wget
import wget url = 'http://www.futurecrew.com/skaven/song_files/mp3/razorback.mp3' filename = wget.download(url) 100% [................................................] 3841532 / 3841532 filename 'razorback.mp3'
If you're still struggling with the processing of time and date in Python, you need Pendulum. It is a python package that simplifies datetime operations. It is a temporary replacement for Python native classes.
$ pip install pendulum
import pendulum dt_toronto = pendulum.datetime(2012, 1, 1, tz='America/Toronto') dt_vancouver = pendulum.datetime(2012, 1, 1, tz='America/Vancouver') print(dt_vancouver.diff(dt_toronto).in_hours()) 3
Most classification algorithms are the most effective when the number of samples of each class is almost the same, but most of them are unbalanced data sets, which may affect the learning stage and subsequent prediction of machine learning algorithm. Fortunately, creating the imbalance learn library can solve this problem. It is compatible with scikit learn and is part of the scikit learning contrib project. Next time you encounter an unbalanced data set, don't forget it.
pip install -U imbalanced-learn # or conda install -c conda-forge imbalanced-learn
Cleaning up text data in NLP tasks usually requires replacing keywords or extracting keywords from sentences. In general, such operations can be done with regular expressions, but if the number of words to search reaches thousands, these operations will become very cumbersome.
The FlashText module of Python is based on the FlashText algorithm, which provides a suitable alternative for this situation. The best thing about FlashText is that it doesn't run with your search volume.
$ pip install flashtext
1) Extract keywords
from flashtext import KeywordProcessor keyword_processor = KeywordProcessor() # keyword_processor.add_keyword(<unclean name>, <standardised name>) keyword_processor.add_keyword('Big Apple', 'New York') keyword_processor.add_keyword('Bay Area') keywords_found = keyword_processor.extract_keywords('I love Big Apple and Bay Area.') keywords_found ['New York', 'Bay Area']
2) Alternative keywords
keyword_processor.add_keyword('New Delhi', 'NCR region') new_sentence = keyword_processor.replace_keywords('I love Big Apple and new delhi.') new_sentence 'I love New York and NCR region.'
This name sounds strange, but FuzzyWuzzy is a very useful library for string matching. It can easily achieve string matching rate and other operations. It can also easily match records stored in different databases.
$ pip install fuzzywuzzy
from fuzzywuzzy import fuzz from fuzzywuzzy import process # Simple Ratio fuzz.ratio("this is a test", "this is a test!") 97 # Partial Ratio fuzz.partial_ratio("this is a test", "this is a test!") 100
Time series analysis is one of the most common problems in machine learning. PyFlux is an open source library in Python, which is built to deal with time series problems. The library has a series of excellent modern time series models, such as ARIMA, GARCH and VAR models. In short, PyFlux provides a probabilistic approach to time series modeling.
pip install pyflux
An important part of data science is the communication of results. Visualization of results can provide you with a huge advantage. IPyvolume is a Python library for visualizing 3D capacity and symbols (such as 3D scatter diagram) in Jupyter notebooks, requiring only a small amount of configuration.
Using pip $ pip install ipyvolume Conda/Anaconda $ conda install -c conda-forge ipyvolume
pip install dash==0.29.0 # The core dash backend pip install dash-html-components==0.13.2 # HTML components pip install dash-core-components==0.36.0 # Supercharged components pip install dash-table==3.1.3 # Interactive DataTable component (new!)
The following example shows a highly interactive graph with drop-down capabilities. When the user selects a value in the drop-down menu, the application code dynamically exports the data from Google Finance to the panda DataFrame.
Gym is a tool for developing and comparing reinforcement learning algorithms. It is compatible with any data science library, such as TensorFlow or Theano. It's a collection of test problems, also called environments, that you can use to calculate reinforcement learning algorithms. These environments have a shared interface that allows users to write general algorithms.
pip install gym
The following example will run 1000 times in the CartPole-v0 environment, rendering the environment at each step.