In data science, we frequently create new columns in tables by applying functions to existing columns or arrays. Typically, these functions take arrays as arguments, but there are instances when we need to utilize functions that accept individual values instead of arrays. For instance, consider the function cut_off_at_100
, designed to return the smaller of its argument and 100:
def cut_off_at_100(x):
"""The smaller of x and 100"""
return min(x, 100)
This function behaves as follows:
cut_off_at_100(17)
returns 17
cut_off_at_100(117)
returns 100
cut_off_at_100(100)
returns 100This function can be particularly useful when dealing with age data, where values exceeding 100 should be capped.
apply
MethodTo apply the cut_off_at_100
function across multiple entries in an age column, we leverage the apply
method provided in the Table
object. This method facilitates the invocation of a function for each element within a specified column, yielding a new array of results. This process is analogous to showing a recipe to a chef without immediately baking cakes ourselves, similar to referring the function without calling it directly.
To illustrate this, let's define a table (named ages
) with various people and their respective ages:
from datascience import *
ages = Table().with_columns(
'Person', make_array('A', 'B', 'C', 'D', 'E', 'F'),
'Age', make_array(17, 117, 52, 100, 6, 101)
)
After creating the age table, we can use the apply
method to calculate the cut-off ages:
cut_off_ages = ages.apply(cut_off_at_100, 'Age')
The result will yield an array where the age values are capped at 100. For example:
Entry A remains 17
Entry B is capped at 100
Entry C remains 52 This helps maintain the integrity of the age data in our analyses.
In Python, functions themselves are treated as values. Thus, you can refer to a function by its name without parentheses to indicate you are referencing it rather than invoking it. This allows you to assign new names to functions as needed. For instance:
cut_off = cut_off_at_100
This means that cut_off
is now an alias for the cut_off_at_100
function, and using either name will produce the same results.
Another application of the apply
function can be seen in making predictions based on datasets. For instance, consider a dataset containing heights of children and their parents. By utilizing the average height of parents, it is possible to predict the height of a child.
parent_averages = (family_heights.column('father') + family_heights.column('mother')) / 2
height_predictions = heights.with_columns('Prediction', heights.apply(predict_child, 'Parent Average'))
These predictions can provide insightful correlations, as figured by the average height of the children against the heights of their parents.
Overall, utilizing functions, particularly with methods such as apply
, allows for efficient data transformations and predictive analytics in data science, enabling one to adapt to various analytical challenges. As demonstrated through the age capping and height predictions, programming logic can facilitate meaningful insights derived from data.