In databases, it is common for data concerning the same individuals to be stored across multiple tables. For instance, one university office might track data regarding the time students take to complete their degrees, while another might record their tuition fees and financial aid. Integrating these datasets can provide a clearer view of the overall student experience. When these data are distributed across two tables, where each table has one row for each student, we can blend the columns together while ensuring that each student's data remains in a single coherent row. This integration can be illustrated through a simplified example, and subsequently applied to larger datasets.
In our example, we might have a table called "cones" that contains flavours of ice cream, and another table that records ratings for those flavours. The rating table is defined as follows:
ratings = Table().with_columns(
'Kind', make_array('strawberry', 'chocolate', 'vanilla'),
'Stars', make_array(2.5, 3.5, 4)
)
Here, the "cones" table includes a column titled "Flavor," while the "ratings" table includes a column called "Kind." The data within these columns can be used to associate the two tables effectively. The join
function creates a new table, augmenting each cone entry in the "cones" table with the corresponding star rating from the "ratings" table. For every cone in the "cones" table, the join
function matches a row from the "ratings" table where the "Kind" aligns with the cone's "Flavor."
This is accomplished through the following command:
rated = cones.join('Flavor', ratings, 'Kind')
After executing the join operation, each ice cream cone will have both its price and the rating of its flavour. The structure of this augmented table illustrates how to assess the value of each cone by calculating the price per star. Lower prices per star indicate better value as they signify that consumers are paying less for each point of the rating scale. The computation can be expressed as follows:
rated.with_column('$ per Star', rated.column('Price') / rated.column('Stars')).sort(3)
Notably, while strawberry cones might receive the lowest average rating among flavours, their lower price enables them to excel in the price-per-star metric.
It’s important to mention that the order in which two tables are listed while performing the join
operation can impact the appearance of the resulting table's columns but does not fundamentally change the data itself. For example, switching the order:
ratings.join('Kind', cones, 'Flavor')
This command demonstrates how the order can affect column arrangement, yet the underlying data relationship remains unchanged.
The join
function will only include information that is present in both tables. This can lead to certain entries being excluded. For instance, if we have a table of reviews evaluating various ice cream cones and find the average ratings for each flavour through the command:
average_review = reviews.group('Flavor', np.average)
To join this average rating table with the "cones" table, we again identify the columns necessary for matching:
average_review.join('Flavor', cones)
The impact of excluding entries that lack matching data can be significant. For example, if no reviews exist for strawberry cones, they will not appear in the joined table. This lack of data might be an obstacle or not depending on what kind of analysis is being conducted. The outcome may look like this:
Flavor | Price | Stars | Average |
---|---|---|---|
Chocolate | 6.55 | 3.5 | 5 |
Chocolate | 5.75 | 3.5 | 4 |
Ultimately, the absence of strawberry cones demonstrates that lack of reviews can lead to gaps in the joined dataset, highlighting the necessity of understanding the data's completeness for effective analysis.