pandas data type conversion

Data types in data processing

  • When processing data with pandas, we often encounter the problem of data type. When we get the data, we first need to make sure that we get the correct type of data. Generally, through the conversion of data type, this article introduces the data type inside pandas (data types are commonly used dtyps), and the data correspondence between pandas and numpy.
  • The main types are object, int64, float64, datetime64, bool, etc. category and timedelta are introduced separately in other articles.Of course, this article will also cover a simple introduction.

Problems with data types are usually discovered after a problem occurs, so once you have some experience, you will get the data, and then directly look at the data type, whether it is consistent with the data format you want to process, which can avoid some embarrassing problems from the beginning.So let's take a simple example and introduce a data type using jupyter notebook.

####Import two commonly used data processing packages, numpy and pandas, by convention
import numpy as np
import pandas as pd
# Reads the data from the csv file, the data table only has 5 rows, contains float, string, int three data python types, namely float64, object, int64 for the corresponding pandas
# There are six columns in the csv file, the first is the header and the rest is the data.
df = pd.read_csv("sales_data_types.csv")
print(df)
   Customer Number     Customer Name          2016            2017  \
0            10002  Quest Industries  $125,000.00     $162,500.00    
1           552278    Smith Plumbing  $920,000.00   $1,012,000.00    
2            23477   ACME Industrial   $50,000.00      $62,500.00    
3            24900        Brekke LTD  $350,000.00     $490,000.00    
4           651029         Harbor Co   $15,000.00      $12,750.00    

  Percent Growth Jan Units  Month  Day  Year Active  
0         30.00%       500      1   10  2015      Y  
1         10.00%       700      6   15  2014      Y  
2         25.00%       125      3   29  2016      Y  
3          4.00%        75     10   27  2015      Y  
4        -15.00%    Closed      2    2  2014      N  
df.dtypes
Customer Number     int64
Customer Name      object
2016               object
2017               object
Percent Growth     object
Jan Units          object
Month               int64
Day                 int64
Year                int64
Active             object
dtype: object
# If you want to get the sum of 2016 and 2017 data, you can try, but this is not the answer we need, because the data type in these two columns is object, and after doing so, you get a longer string.
# Of course, we can get more detailed information about the data frame through df.info().
df['2016']+df['2017']

0      $125,000.00 $162,500.00 
1    $920,000.00 $1,012,000.00 
2        $50,000.00 $62,500.00 
3      $350,000.00 $490,000.00 
4        $15,000.00 $12,750.00 
dtype: object
df.info()
# Customer Number column is float64, but should be int64
# The data for the 2016 2017 columns are object s, not float64 or int64 formats
# Percent and Jan Units are also objects, not numbers
# Month, Day, and Year should be converted to datetime64[ns] format
# Active column should be Boolean
# Without data cleaning, it is difficult to do the next step in data analysis. There are three more common methods in pandas for data format conversion.
# 1. astype() Forces the conversion of data types
# 2. Data transformation by creating custom functions
# 3. to_nueric() and to_datetime() provided by pandas
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 10 columns):
Customer Number    5 non-null int64
Customer Name      5 non-null object
2016               5 non-null object
2017               5 non-null object
Percent Growth     5 non-null object
Jan Units          5 non-null object
Month              5 non-null int64
Day                5 non-null int64
Year               5 non-null int64
Active             5 non-null object
dtypes: int64(4), object(6)
memory usage: 480.0+ bytes

First, the most commonly used astype()

For example, astype() can be used to convert the data in the first column to an integer int type

df['Customer Number'].astype("int")
#  This does not change the original data frame, it just returns a copy
0     10002
1    552278
2     23477
3     24900
4    651029
Name: Customer Number, dtype: int32
# To really change a data frame, you usually need to assign values, such as
df["Customer Number"] = df["Customer Number"].astype("int")
print(df)
print("--------"*10)
print(df.dtypes)
   Customer Number     Customer Name          2016            2017  \
0            10002  Quest Industries  $125,000.00     $162,500.00    
1           552278    Smith Plumbing  $920,000.00   $1,012,000.00    
2            23477   ACME Industrial   $50,000.00      $62,500.00    
3            24900        Brekke LTD  $350,000.00     $490,000.00    
4           651029         Harbor Co   $15,000.00      $12,750.00    

  Percent Growth Jan Units  Month  Day  Year Active  
0         30.00%       500      1   10  2015      Y  
1         10.00%       700      6   15  2014      Y  
2         25.00%       125      3   29  2016      Y  
3          4.00%        75     10   27  2015      Y  
4        -15.00%    Closed      2    2  2014      N  
--------------------------------------------------------------------------------
Customer Number     int32
Customer Name      object
2016               object
2017               object
Percent Growth     object
Jan Units          object
Month               int64
Day                 int64
Year                int64
Active             object
dtype: object
# By assigning values to transform data from the original data frame, you can take a fresh look at our newly generated data frame
print(df)
   Customer Number     Customer Name          2016            2017  \
0            10002  Quest Industries  $125,000.00     $162,500.00    
1           552278    Smith Plumbing  $920,000.00   $1,012,000.00    
2            23477   ACME Industrial   $50,000.00      $62,500.00    
3            24900        Brekke LTD  $350,000.00     $490,000.00    
4           651029         Harbor Co   $15,000.00      $12,750.00    

  Percent Growth Jan Units  Month  Day  Year Active  
0         30.00%       500      1   10  2015      Y  
1         10.00%       700      6   15  2014      Y  
2         25.00%       125      3   29  2016      Y  
3          4.00%        75     10   27  2015      Y  
4        -15.00%    Closed      2    2  2014      N  
#Then, columns like Percent Growth, Jan Units, with special symbols, cannot be converted directly by the astype ("flaot") method.
#This, in conjunction with the conversion of a string in python to a floating point number, requires that the original character only contain the number itself, not other special characters
 #We can try to convert the Active column to a Boolean value and see what happens. All five results are True, indicating that they didn't work.
#df["Active"].astype("bool")

df['2016'].astype('float')
---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-19-47cc9d68cd65> in <module>()
----> 1 df['2016'].astype('float')


C:\Anaconda3\lib\site-packages\pandas\core\generic.py in astype(self, dtype, copy, raise_on_error, **kwargs)
   3052         # else, only a single dtype is given
   3053         new_data = self._data.astype(dtype=dtype, copy=copy,
-> 3054                                      raise_on_error=raise_on_error, **kwargs)
   3055         return self._constructor(new_data).__finalize__(self)
   3056 


C:\Anaconda3\lib\site-packages\pandas\core\internals.py in astype(self, dtype, **kwargs)
   3187 
   3188     def astype(self, dtype, **kwargs):
-> 3189         return self.apply('astype', dtype=dtype, **kwargs)
   3190 
   3191     def convert(self, **kwargs):


C:\Anaconda3\lib\site-packages\pandas\core\internals.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
   3054 
   3055             kwargs['mgr'] = self
-> 3056             applied = getattr(b, f)(**kwargs)
   3057             result_blocks = _extend_blocks(applied, result_blocks)
   3058 


C:\Anaconda3\lib\site-packages\pandas\core\internals.py in astype(self, dtype, copy, raise_on_error, values, **kwargs)
    459                **kwargs):
    460         return self._astype(dtype, copy=copy, raise_on_error=raise_on_error,
--> 461                             values=values, **kwargs)
    462 
    463     def _astype(self, dtype, copy=False, raise_on_error=True, values=None,


C:\Anaconda3\lib\site-packages\pandas\core\internals.py in _astype(self, dtype, copy, raise_on_error, values, klass, mgr, **kwargs)
    502 
    503                 # _astype_nansafe works fine with 1-d only
--> 504                 values = _astype_nansafe(values.ravel(), dtype, copy=True)
    505                 values = values.reshape(self.shape)
    506 


C:\Anaconda3\lib\site-packages\pandas\types\cast.py in _astype_nansafe(arr, dtype, copy)
    535 
    536     if copy:
--> 537         return arr.astype(dtype)
    538     return arr.view(dtype)
    539 


ValueError: could not convert string to float: '$15,000.00 '

The above issues illustrate some issues

  • If the data is pure data, it can be converted to numbers
  • Astype is basically two kinds of use, the number is converted to a simple string, the string of a simple number is converted to a number, and strings containing other non-numbers cannot be converted through astype.
  • You need to introduce other methods to transform, so you have the following custom function methods

Clean up data through custom functions

  • Currency can be converted using the following functions
def convert_currency(var):
    """
    convert the string number to a float
    _ Remove $
    - Remove commas,
    - Convert to floating point type
    """
    new_value = var.replace(",","").replace("$","")
    return float(new_value)
# Strings are converted to floating-point numbers by removing the $and commas with the replace function, allowing pandas to select a specific type, float or int, that pandas deems appropriate, in which case the data is converted to float64
# Convert all data in column 2016 through the apply function in pandas
df["2016"].apply(convert_currency)
0    125000.0
1    920000.0
2     50000.0
3    350000.0
4     15000.0
Name: 2016, dtype: float64
# Of course, you can pass this simpler function one line at a time through the lambda function
df["2016"].apply(lambda x: x.replace(",","").replace("$","")).astype("float64")
0    125000.0
1    920000.0
2     50000.0
3    350000.0
4     15000.0
Name: 2016, dtype: float64
#You can also use lambda expressions to clean up PercentGrowth data
df["Percent Growth"].apply(lambda x: x.replace("%","")).astype("float")/100
0    0.30
1    0.10
2    0.25
3    0.04
4   -0.15
Name: Percent Growth, dtype: float64
# It can also be solved by a custom function, with the same result
# The last custom function is to use the np.where() function to convert the Active column to a Boolean value.
df["Active"] = np.where(df["Active"] == "Y", True, False)

df["Active"]
0     True
1     True
2     True
3     True
4    False
Name: Active, dtype: bool
# Here's a look at the data format
df["2016"]=df["2016"].apply(lambda x: x.replace(",","").replace("$","")).astype("float64")
df["2017"]=df["2017"].apply(lambda x: x.replace(",","").replace("$","")).astype("float64")
df["Percent Growth"]=df["Percent Growth"].apply(lambda x: x.replace("%","")).astype("float")/100
df.dtypes
Customer Number      int32
Customer Name       object
2016               float64
2017               float64
Percent Growth     float64
Jan Units           object
Month                int64
Day                  int64
Year                 int64
Active                bool
dtype: object
# View the DataFrame again
# At this point only the formats in Jan Units need to be converted and the year, month and day merged, which can be handled using several functions that come with the pandas
print(df)
   Customer Number     Customer Name      2016       2017  Percent Growth  \
0            10002  Quest Industries  125000.0   162500.0            0.30   
1           552278    Smith Plumbing  920000.0  1012000.0            0.10   
2            23477   ACME Industrial   50000.0    62500.0            0.25   
3            24900        Brekke LTD  350000.0   490000.0            0.04   
4           651029         Harbor Co   15000.0    12750.0           -0.15   

  Jan Units  Month  Day  Year Active  
0       500      1   10  2015   True  
1       700      6   15  2014   True  
2       125      3   29  2016   True  
3        75     10   27  2015   True  
4    Closed      2    2  2014  False  

Processing with functions in pandas

# pd.to_numeric() in pandas handles data in Jan Units
pd.to_numeric(df["Jan Units"],errors='coerce').fillna(0)
0    500.0
1    700.0
2    125.0
3     75.0
4      0.0
Name: Jan Units, dtype: float64
# Finally, the year, month and day are merged using pd.to_datatime()
pd.to_datetime(df[['Month', 'Day', 'Year']])
0   2015-01-10
1   2014-06-15
2   2016-03-29
3   2015-10-27
4   2014-02-02
dtype: datetime64[ns]
# Do this without forgetting to reassign, otherwise the original data will not change
df["Jan Units"] = pd.to_numeric(df["Jan Units"],errors='coerce')
df["Start_date"] = pd.to_datetime(df[['Month', 'Day', 'Year']])
df
Customer Number Customer Name 2016 2017 Percent Growth Jan Units Month Day Year Active Start_date
0 10002 Quest Industries 125000.0 162500.0 0.30 500.0 1 10 2015 True 2015-01-10
1 552278 Smith Plumbing 920000.0 1012000.0 0.10 700.0 6 15 2014 True 2014-06-15
2 23477 ACME Industrial 50000.0 62500.0 0.25 125.0 3 29 2016 True 2016-03-29
3 24900 Brekke LTD 350000.0 490000.0 0.04 75.0 10 27 2015 True 2015-10-27
4 651029 Harbor Co 15000.0 12750.0 -0.15 NaN 2 2 2014 False 2014-02-02
df.dtypes
Customer Number             int32
Customer Name              object
2016                      float64
2017                      float64
Percent Growth            float64
Jan Units                 float64
Month                       int64
Day                         int64
Year                        int64
Active                       bool
Start_date         datetime64[ns]
dtype: object
# Integrate these transformations
def convert_percent(val):
    """
    Convert the percentage string to an actual floating point percent
    - Remove %
    - Divide by 100 to make decimal
    """
    new_val = val.replace('%', '')
    return float(new_val) / 100

df_2 = pd.read_csv("sales_data_types.csv",dtype={"Customer_Number":"int"},converters={
    "2016":convert_currency,
    "2017":convert_currency,
    "Percent Growth":convert_percent,
    "Jan Units":lambda x:pd.to_numeric(x,errors="coerce"),
    "Active":lambda x: np.where(x=="Y",True,False)
})
df_2.dtypes
Customer Number      int64
Customer Name       object
2016               float64
2017               float64
Percent Growth     float64
Jan Units          float64
Month                int64
Day                  int64
Year                 int64
Active              bool
dtype: object
df_2
Customer Number Customer Name 2016 2017 Percent Growth Jan Units Month Day Year Active
0 10002 Quest Industries 125000.0 162500.0 0.30 500.0 1 10 2015 True
1 552278 Smith Plumbing 920000.0 1012000.0 0.10 700.0 6 15 2014 True
2 23477 ACME Industrial 50000.0 62500.0 0.25 125.0 3 29 2016 True
3 24900 Brekke LTD 350000.0 490000.0 0.04 75.0 10 27 2015 True
4 651029 Harbor Co 15000.0 12750.0 -0.15 NaN 2 2 2014 False

So far, there are two types of data in pandas: timedelta and category. Later, we will focus on category, which is designed with reference to category in R, added after pandas 0.16, and then tidy up the common methods of pandas as needed.

Tags: Lambda Python jupyter REST

Posted on Sun, 19 Apr 2020 02:10:28 -0400 by mouse02