pandas data type conversion

Data types in data processing When processing data with pandas, we often encounter the problem of data type. When we ge...

Data types in data processing

  • When processing data with pandas, we often encounter the problem of data type. When we get the data, we first need to make sure that we get the correct type of data. Generally, through the conversion of data type, this article introduces the data type inside pandas (data types are commonly used dtyps), and the data correspondence between pandas and numpy.
  • The main types are object, int64, float64, datetime64, bool, etc. category and timedelta are introduced separately in other articles.Of course, this article will also cover a simple introduction.
Problems with data types are usually discovered after a problem occurs, so once you have some experience, you will get the data, and then directly look at the data type, whether it is consistent with the data format you want to process, which can avoid some embarrassing problems from the beginning.So let's take a simple example and introduce a data type using jupyter notebook.
####Import two commonly used data processing packages, numpy and pandas, by convention import numpy as np import pandas as pd # Reads the data from the csv file, the data table only has 5 rows, contains float, string, int three data python types, namely float64, object, int64 for the corresponding pandas # There are six columns in the csv file, the first is the header and the rest is the data. df = pd.read_csv("sales_data_types.csv") print(df)
Customer Number Customer Name 2016 2017 \ 0 10002 Quest Industries $125,000.00 $162,500.00 1 552278 Smith Plumbing $920,000.00 $1,012,000.00 2 23477 ACME Industrial $50,000.00 $62,500.00 3 24900 Brekke LTD $350,000.00 $490,000.00 4 651029 Harbor Co $15,000.00 $12,750.00 Percent Growth Jan Units Month Day Year Active 0 30.00% 500 1 10 2015 Y 1 10.00% 700 6 15 2014 Y 2 25.00% 125 3 29 2016 Y 3 4.00% 75 10 27 2015 Y 4 -15.00% Closed 2 2 2014 N
df.dtypes
Customer Number int64 Customer Name object 2016 object 2017 object Percent Growth object Jan Units object Month int64 Day int64 Year int64 Active object dtype: object
# If you want to get the sum of 2016 and 2017 data, you can try, but this is not the answer we need, because the data type in these two columns is object, and after doing so, you get a longer string. # Of course, we can get more detailed information about the data frame through df.info(). df['2016']+df['2017']
0 $125,000.00 $162,500.00 1 $920,000.00 $1,012,000.00 2 $50,000.00 $62,500.00 3 $350,000.00 $490,000.00 4 $15,000.00 $12,750.00 dtype: object
df.info() # Customer Number column is float64, but should be int64 # The data for the 2016 2017 columns are object s, not float64 or int64 formats # Percent and Jan Units are also objects, not numbers # Month, Day, and Year should be converted to datetime64[ns] format # Active column should be Boolean # Without data cleaning, it is difficult to do the next step in data analysis. There are three more common methods in pandas for data format conversion. # 1. astype() Forces the conversion of data types # 2. Data transformation by creating custom functions # 3. to_nueric() and to_datetime() provided by pandas
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5 entries, 0 to 4 Data columns (total 10 columns): Customer Number 5 non-null int64 Customer Name 5 non-null object 2016 5 non-null object 2017 5 non-null object Percent Growth 5 non-null object Jan Units 5 non-null object Month 5 non-null int64 Day 5 non-null int64 Year 5 non-null int64 Active 5 non-null object dtypes: int64(4), object(6) memory usage: 480.0+ bytes

First, the most commonly used astype()

For example, astype() can be used to convert the data in the first column to an integer int type
df['Customer Number'].astype("int") # This does not change the original data frame, it just returns a copy
0 10002 1 552278 2 23477 3 24900 4 651029 Name: Customer Number, dtype: int32
# To really change a data frame, you usually need to assign values, such as df["Customer Number"] = df["Customer Number"].astype("int") print(df) print("--------"*10) print(df.dtypes)
Customer Number Customer Name 2016 2017 \ 0 10002 Quest Industries $125,000.00 $162,500.00 1 552278 Smith Plumbing $920,000.00 $1,012,000.00 2 23477 ACME Industrial $50,000.00 $62,500.00 3 24900 Brekke LTD $350,000.00 $490,000.00 4 651029 Harbor Co $15,000.00 $12,750.00 Percent Growth Jan Units Month Day Year Active 0 30.00% 500 1 10 2015 Y 1 10.00% 700 6 15 2014 Y 2 25.00% 125 3 29 2016 Y 3 4.00% 75 10 27 2015 Y 4 -15.00% Closed 2 2 2014 N -------------------------------------------------------------------------------- Customer Number int32 Customer Name object 2016 object 2017 object Percent Growth object Jan Units object Month int64 Day int64 Year int64 Active object dtype: object
# By assigning values to transform data from the original data frame, you can take a fresh look at our newly generated data frame print(df)
Customer Number Customer Name 2016 2017 \ 0 10002 Quest Industries $125,000.00 $162,500.00 1 552278 Smith Plumbing $920,000.00 $1,012,000.00 2 23477 ACME Industrial $50,000.00 $62,500.00 3 24900 Brekke LTD $350,000.00 $490,000.00 4 651029 Harbor Co $15,000.00 $12,750.00 Percent Growth Jan Units Month Day Year Active 0 30.00% 500 1 10 2015 Y 1 10.00% 700 6 15 2014 Y 2 25.00% 125 3 29 2016 Y 3 4.00% 75 10 27 2015 Y 4 -15.00% Closed 2 2 2014 N
#Then, columns like Percent Growth, Jan Units, with special symbols, cannot be converted directly by the astype ("flaot") method. #This, in conjunction with the conversion of a string in python to a floating point number, requires that the original character only contain the number itself, not other special characters #We can try to convert the Active column to a Boolean value and see what happens. All five results are True, indicating that they didn't work.
#df["Active"].astype("bool")
df['2016'].astype('float')
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-19-47cc9d68cd65> in <module>() ----> 1 df['2016'].astype('float') C:\Anaconda3\lib\site-packages\pandas\core\generic.py in astype(self, dtype, copy, raise_on_error, **kwargs) 3052 # else, only a single dtype is given 3053 new_data = self._data.astype(dtype=dtype, copy=copy, -> 3054 raise_on_error=raise_on_error, **kwargs) 3055 return self._constructor(new_data).__finalize__(self) 3056 C:\Anaconda3\lib\site-packages\pandas\core\internals.py in astype(self, dtype, **kwargs) 3187 3188 def astype(self, dtype, **kwargs): -> 3189 return self.apply('astype', dtype=dtype, **kwargs) 3190 3191 def convert(self, **kwargs): C:\Anaconda3\lib\site-packages\pandas\core\internals.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs) 3054 3055 kwargs['mgr'] = self -> 3056 applied = getattr(b, f)(**kwargs) 3057 result_blocks = _extend_blocks(applied, result_blocks) 3058 C:\Anaconda3\lib\site-packages\pandas\core\internals.py in astype(self, dtype, copy, raise_on_error, values, **kwargs) 459 **kwargs): 460 return self._astype(dtype, copy=copy, raise_on_error=raise_on_error, --> 461 values=values, **kwargs) 462 463 def _astype(self, dtype, copy=False, raise_on_error=True, values=None, C:\Anaconda3\lib\site-packages\pandas\core\internals.py in _astype(self, dtype, copy, raise_on_error, values, klass, mgr, **kwargs) 502 503 # _astype_nansafe works fine with 1-d only --> 504 values = _astype_nansafe(values.ravel(), dtype, copy=True) 505 values = values.reshape(self.shape) 506 C:\Anaconda3\lib\site-packages\pandas\types\cast.py in _astype_nansafe(arr, dtype, copy) 535 536 if copy: --> 537 return arr.astype(dtype) 538 return arr.view(dtype) 539 ValueError: could not convert string to float: '$15,000.00 '
The above issues illustrate some issues
  • If the data is pure data, it can be converted to numbers
  • Astype is basically two kinds of use, the number is converted to a simple string, the string of a simple number is converted to a number, and strings containing other non-numbers cannot be converted through astype.
  • You need to introduce other methods to transform, so you have the following custom function methods

Clean up data through custom functions

  • Currency can be converted using the following functions
def convert_currency(var): """ convert the string number to a float _ Remove $ - Remove commas, - Convert to floating point type """ new_value = var.replace(",","").replace("$","") return float(new_value)
# Strings are converted to floating-point numbers by removing the $and commas with the replace function, allowing pandas to select a specific type, float or int, that pandas deems appropriate, in which case the data is converted to float64 # Convert all data in column 2016 through the apply function in pandas df["2016"].apply(convert_currency)
0 125000.0 1 920000.0 2 50000.0 3 350000.0 4 15000.0 Name: 2016, dtype: float64
# Of course, you can pass this simpler function one line at a time through the lambda function df["2016"].apply(lambda x: x.replace(",","").replace("$","")).astype("float64")
0 125000.0 1 920000.0 2 50000.0 3 350000.0 4 15000.0 Name: 2016, dtype: float64
#You can also use lambda expressions to clean up PercentGrowth data df["Percent Growth"].apply(lambda x: x.replace("%","")).astype("float")/100
0 0.30 1 0.10 2 0.25 3 0.04 4 -0.15 Name: Percent Growth, dtype: float64
# It can also be solved by a custom function, with the same result # The last custom function is to use the np.where() function to convert the Active column to a Boolean value. df["Active"] = np.where(df["Active"] == "Y", True, False) df["Active"]
0 True 1 True 2 True 3 True 4 False Name: Active, dtype: bool
# Here's a look at the data format df["2016"]=df["2016"].apply(lambda x: x.replace(",","").replace("$","")).astype("float64") df["2017"]=df["2017"].apply(lambda x: x.replace(",","").replace("$","")).astype("float64") df["Percent Growth"]=df["Percent Growth"].apply(lambda x: x.replace("%","")).astype("float")/100 df.dtypes
Customer Number int32 Customer Name object 2016 float64 2017 float64 Percent Growth float64 Jan Units object Month int64 Day int64 Year int64 Active bool dtype: object
# View the DataFrame again # At this point only the formats in Jan Units need to be converted and the year, month and day merged, which can be handled using several functions that come with the pandas print(df)
Customer Number Customer Name 2016 2017 Percent Growth \ 0 10002 Quest Industries 125000.0 162500.0 0.30 1 552278 Smith Plumbing 920000.0 1012000.0 0.10 2 23477 ACME Industrial 50000.0 62500.0 0.25 3 24900 Brekke LTD 350000.0 490000.0 0.04 4 651029 Harbor Co 15000.0 12750.0 -0.15 Jan Units Month Day Year Active 0 500 1 10 2015 True 1 700 6 15 2014 True 2 125 3 29 2016 True 3 75 10 27 2015 True 4 Closed 2 2 2014 False

Processing with functions in pandas

# pd.to_numeric() in pandas handles data in Jan Units pd.to_numeric(df["Jan Units"],errors='coerce').fillna(0)
0 500.0 1 700.0 2 125.0 3 75.0 4 0.0 Name: Jan Units, dtype: float64
# Finally, the year, month and day are merged using pd.to_datatime() pd.to_datetime(df[['Month', 'Day', 'Year']])
0 2015-01-10 1 2014-06-15 2 2016-03-29 3 2015-10-27 4 2014-02-02 dtype: datetime64[ns]
# Do this without forgetting to reassign, otherwise the original data will not change df["Jan Units"] = pd.to_numeric(df["Jan Units"],errors='coerce') df["Start_date"] = pd.to_datetime(df[['Month', 'Day', 'Year']])
df
Customer Number Customer Name 2016 2017 Percent Growth Jan Units Month Day Year Active Start_date 0 10002 Quest Industries 125000.0 162500.0 0.30 500.0 1 10 2015 True 2015-01-10 1 552278 Smith Plumbing 920000.0 1012000.0 0.10 700.0 6 15 2014 True 2014-06-15 2 23477 ACME Industrial 50000.0 62500.0 0.25 125.0 3 29 2016 True 2016-03-29 3 24900 Brekke LTD 350000.0 490000.0 0.04 75.0 10 27 2015 True 2015-10-27 4 651029 Harbor Co 15000.0 12750.0 -0.15 NaN 2 2 2014 False 2014-02-02
df.dtypes
Customer Number int32 Customer Name object 2016 float64 2017 float64 Percent Growth float64 Jan Units float64 Month int64 Day int64 Year int64 Active bool Start_date datetime64[ns] dtype: object
# Integrate these transformations def convert_percent(val): """ Convert the percentage string to an actual floating point percent - Remove % - Divide by 100 to make decimal """ new_val = val.replace('%', '') return float(new_val) / 100 df_2 = pd.read_csv("sales_data_types.csv",dtype={"Customer_Number":"int"},converters={ "2016":convert_currency, "2017":convert_currency, "Percent Growth":convert_percent, "Jan Units":lambda x:pd.to_numeric(x,errors="coerce"), "Active":lambda x: np.where(x=="Y",True,False) })
df_2.dtypes
Customer Number int64 Customer Name object 2016 float64 2017 float64 Percent Growth float64 Jan Units float64 Month int64 Day int64 Year int64 Active bool dtype: object
df_2
Customer Number Customer Name 2016 2017 Percent Growth Jan Units Month Day Year Active 0 10002 Quest Industries 125000.0 162500.0 0.30 500.0 1 10 2015 True 1 552278 Smith Plumbing 920000.0 1012000.0 0.10 700.0 6 15 2014 True 2 23477 ACME Industrial 50000.0 62500.0 0.25 125.0 3 29 2016 True 3 24900 Brekke LTD 350000.0 490000.0 0.04 75.0 10 27 2015 True 4 651029 Harbor Co 15000.0 12750.0 -0.15 NaN 2 2 2014 False

So far, there are two types of data in pandas: timedelta and category. Later, we will focus on category, which is designed with reference to category in R, added after pandas 0.16, and then tidy up the common methods of pandas as needed.

19 April 2020, 02:10 | Views: 2122

Add new comment

For adding a comment, please log in
or create account

0 comments