data processing and classification

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/13

There's no tags or description

Looks like no tags are added yet.

Last updated 4:18 AM on 12/12/25

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

14 Terms

New cards

absolute vs derived data

absolute = raw data, counts of measurement

derived = standardized, a count of data then divided → an average

New cards

5 types of derived data

Proportion = a proportion of the whole → geography students/total students
Percent = a proportion per 100/1000 → geography students/total students * 100
Ratio = a proportion between two variables → geography students/non-geography students
Density = a proportion within land → geography students/sq.km
rate = calculated for many things → number of cases/population * 100,000

New cards

population vs sample

Population = the total set of elements that one can study

Ex: everyone in the class of GEOG 372 is a population, the prof can ask us anything

Sample = portion of the population you actually study

Ex: asking only one row in the classroom questions and assuming it applies to everyone

New cards

statistics vs descriptive statistics vs inferential statistics

Statistics = characteristics

Descriptive statistics = description of statistics of the population/sample

Inferential statistics = when you apply the sample values to the whole population

Ex: voting results in the newspaper when it says the results present a margin of error along with the results because it assumes the sample results are reflective of the whole population

New cards

extrapolation

= describing a population based on the sample

New cards

unordered data vs ordered data

Unordered Data = not sorted by proportion

Ordered data = sorted by proportion

Allows you to see the minimum, maximum, range (could be expressed as a range or a single number by subtracting max from min), duplicates, and outliners
Can kind of see geographic pattern if you are aware of where these places are -> it is easier if you actually make the map to visualize the data

New cards

measures of central tendency

= measures of where the data is centered

Mean = average value across all data -> affected by outliers
Median = middle number -> not affected by outliers
Mode = the most often occurring value -> can be useful for nominal data

New cards

data distribution

= thinks of how the data is spread out, not forgetting the outliers, but paying attention to the range and outliers. Ways to show data distribution:

Point graph/number line = looks at the range of data through a line and marking each data value within the line
- Makes it more obvious that there are outliers
- Drawbacks: obscures duplicates, can see the amount of duplicates there are
Histogram = a bar graph, the x axis show the range, the y axis show the amount of that value there is
- Useful for looking at how data is distribution
- Useful to include these on a map if the value distribution is important to the map
- Can tell you a lot how the data is distributed

New cards

four types of data distribution

Normal = a bell curve, the largest occurrence is in the middle of the range
- Common in nature and all over the place
Uniform = the same or almost the same number of occurrence in the data set (also know as an even distribution)
- Less common to have a perfectly uniform distribution
Skewed = a lot of the values fall in out end or the other
- Not exactly an outlier, because outliers only show one or two values on the other side
Outlier = a normal with one or two values on the other side

New cards

two measures of dispersion

Range - shows how spread out data is
Standard deviation = another measure of how spread out the values on a data set is - it is the average distance between values and the mean in the data set
- The larger standard deviation, the larger variation
- The smaller standard deviation, the smaller variation

New cards

scatterplot graph

= allows you to see two variables which is a way to show correlation between variables

A line of best fit will show the correlation
Correlation does not result in causation
Strength of correlation is dependant on the units it is being measured, knowing that the correlation holds at several units you know that correlation is strong

New cards

data classification

= process of combining data into groups, or classes, with each class represented by a different symbol

Usually these are choropleth maps with areas shaded by different hues -> need classification to differentiate between data symbols
Usually use 4-6 different classes -> anything more than six is hard to distinguish between the classifications
There usually isn't one "right" classification to use
- Depends on the audience
- Depends on what data you are classifying

New cards

four methods of data classification

Equal Interval = each class interval has an equal space between

Proper equal intervals = only consider the range of your data
Easy to compute and easy to understand -> but in this map specifically there is an empty class because the values are not evenly distributed
Doesn't show data distribution
Best used when data is evenly distributed to have representation of each class

Quantiles = put an equal number of data points in each class -> the number values are equal
- Ex: quantiles (4 classes), quintiles (5 classes), sextiles (6 classes)
- For number classes you can just pick how many classes you want
- Sometimes when quantiles don't work out perfectly, you can manually move the values so that at least the same values are in the same group
- Easy to compute and easy to understand
- Good for ordinal data
- Map often turns out looking nice
- Doesn't consider data distribution, get widely different values class together and close values classed different
- Good for evenly distributed data

Mean-SD = Start with the mean, then subtract the SD from the mean, so everything below the mean becomes the first class, the second class is the mean. Then to get the third and fourth class, you add the SD to the mean

Generally works very well when the mean is a useful dividing point in the data and creates an obvious dividing point
Does a good job of showing that there is one small outlier and one big outlier
Good for normally distributed data
Needs an understanding of statistics to compute it, its important for the map user to interpret the map well (so if they don't have a background in statistics, it might not make sense)

natural breaks = use naturally occurring breakpoints/gaps to minimize the difference between values that fall in the same gaps and maximize the difference between values that are different
- Can do this manually, by looking at the data and looking for obvious breaking points -> different people will break groups in different points (subjective)
- Another way to do this is with the Jenks optimal method/Jenks optimization algorithm = algorithm that does what you would do subjectively, optimizes sameness between classes -> often the best choice
  - The default algorithm in ARCGIS

New cards

how to pick data classification methods

Equal Interval doesn't really represent data well when there are outliers and overall distribution
Quantiles doesn't take into account outliers
Mean - SD places too much importance on the mean, and shouldn't be used for a standard audience
Natural Breaks shows similarities and outliers, and overall data distribution