Cool, but very practical Python Library

Python is a great programming language. In fact, it is also one of the fastest growing programming languages in the world. It has proved its practicability in data science position again and again. The ecosystem of Python and its libraries makes it the right choice for users around the world (beginner and advanced).


In this article, we will introduce some Python libraries for data science. They are not as well-known as pandas, scikit learn and matplotlib, but they are also very practical.





Extracting data, especially from the network, is one of the main tasks of data scientists. wget is a free utility for non interactive file downloads from the Web. It supports HTTP, HTTPS and FTP protocols, as well as retrieval through HTTP proxy. Because it is non interactive, it can run in the background even if the user is not logged in. So if you need to download all the pictures from a website or page, wget can help you




$ pip install wget




import wget
url = ''

filename =
100% [................................................] 3841532 / 3841532






If you're still struggling with the processing of time and date in Python, you need Pendulum. It is a python package that simplifies datetime operations. It is a temporary replacement for Python native classes.




$ pip install pendulum




import pendulum

dt_toronto = pendulum.datetime(2012, 1, 1, tz='America/Toronto')
dt_vancouver = pendulum.datetime(2012, 1, 1, tz='America/Vancouver')







Most classification algorithms are the most effective when the number of samples of each class is almost the same, but most of them are unbalanced data sets, which may affect the learning stage and subsequent prediction of machine learning algorithm. Fortunately, creating the imbalance learn library can solve this problem. It is compatible with scikit learn and is part of the scikit learning contrib project. Next time you encounter an unbalanced data set, don't forget it.




pip install -U imbalanced-learn

# or

conda install -c conda-forge imbalanced-learn





Cleaning up text data in NLP tasks usually requires replacing keywords or extracting keywords from sentences. In general, such operations can be done with regular expressions, but if the number of words to search reaches thousands, these operations will become very cumbersome.


The FlashText module of Python is based on the FlashText algorithm, which provides a suitable alternative for this situation. The best thing about FlashText is that it doesn't run with your search volume.




$ pip install flashtext




1) Extract keywords


from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()

# keyword_processor.add_keyword(<unclean name>, <standardised name>)

keyword_processor.add_keyword('Big Apple', 'New York')
keyword_processor.add_keyword('Bay Area')
keywords_found = keyword_processor.extract_keywords('I love Big Apple and Bay Area.')

['New York', 'Bay Area']


2) Alternative keywords


keyword_processor.add_keyword('New Delhi', 'NCR region')

new_sentence = keyword_processor.replace_keywords('I love Big Apple and new delhi.')

'I love New York and NCR region.'





This name sounds strange, but FuzzyWuzzy is a very useful library for string matching. It can easily achieve string matching rate and other operations. It can also easily match records stored in different databases.




$ pip install fuzzywuzzy




from fuzzywuzzy import fuzz
from fuzzywuzzy import process

# Simple Ratio

fuzz.ratio("this is a test", "this is a test!")

# Partial Ratio
fuzz.partial_ratio("this is a test", "this is a test!")





Time series analysis is one of the most common problems in machine learning. PyFlux is an open source library in Python, which is built to deal with time series problems. The library has a series of excellent modern time series models, such as ARIMA, GARCH and VAR models. In short, PyFlux provides a probabilistic approach to time series modeling.




pip install pyflux





An important part of data science is the communication of results. Visualization of results can provide you with a huge advantage. IPyvolume is a Python library for visualizing 3D capacity and symbols (such as 3D scatter diagram) in Jupyter notebooks, requiring only a small amount of configuration.




Using pip
$ pip install ipyvolume

$ conda install -c conda-forge ipyvolume











Dash is an efficient Python framework for building web applications. It is based on flashplot.js and Response.js. Bundle UI elements such as pull-down menus and graphics with Python analysis code without using JavaScript. Dash is ideal for building data visualization applications that can be rendered in web browsers.




pip install dash==0.29.0  # The core dash backend
pip install dash-html-components==0.13.2  # HTML components
pip install dash-core-components==0.36.0  # Supercharged components
pip install dash-table==3.1.3  # Interactive DataTable component (new!)




The following example shows a highly interactive graph with drop-down capabilities. When the user selects a value in the drop-down menu, the application code dynamically exports the data from Google Finance to the panda DataFrame.








Gym is a tool for developing and comparing reinforcement learning algorithms. It is compatible with any data science library, such as TensorFlow or Theano. It's a collection of test problems, also called environments, that you can use to calculate reinforcement learning algorithms. These environments have a shared interface that allows users to write general algorithms.




pip install gym




The following example will run 1000 times in the CartPole-v0 environment, rendering the environment at each step.

Tags: Python pip Programming network

Posted on Fri, 08 May 2020 04:24:12 -0400 by shinagawa