Foundations

The goals of a forensic anthropologist is to locate and recover buried or surface remains, estimate a biological profile, analyze any skeletal trauma, and make observations about taphonomy. These goals have applications in the justice system, human rights violations, and in disaster victim identifications or natural and mass disasters. Each one of these applications has different sample compositions, different methods, priorities, and criticisms.

There has been a critique of forensic anthropology for a lack of grounding evolutionary theory with instead a focus on method development.

The rise and fall of Daubert is a good example of this, with the Daubert guidelines establishing rules for evidence admissibility and standardization of this practice. This led to forensic anthropology attempting to standardize their methods to make them admissible for court. However, the fall of Daubert comes with the fact that instead of creating methods with good statistical backing to satisfy the Daubert criteria, FAs should be minimizing error, ensuring confidence in analytical results and grounding their methods in good science because we are scientists who should be practicing good science anyway.

This comes to the question of what is good science, and the necessity of defining terms such as reliability, error and validity. Good science is reliable, valid and minimizes error.

Reliability refers to getting the same score multiple times, meaning that the method is repeatable. This is like shooting arrows at a target and all of your arrows clump together in one spot.

Validity is the probability of being correct more often than chance. Valid methods give correct conclusions more often than chance. Known rates of error give measures of validity. There are two types of validity: internal and external.

Internal validity is whether the study design, conduct and analysis will answer a research question without bias. You internally validate your own study by collecting data, developing a method, and testing that method on the sample your data was collected from.

External validation is the ability to be validated from outside of the tested sample. Your results should be generalizable, or be able to be extended to make predictions about the larger population.

One way to externally validate your method is to use cross validation, or build your model only using a percentage of the available sample and testing your model on the remaining sample that the model is blind to.

Error is very simply when you are incorrect. There are multiple sources of error including practitioner error, instrument error, statistical error and method error.

Practitioner error is human error, it is the result of a mistake whether random or systematic, negligence or incompetence, it is caused by the person or people involved with the study. Instrument error is when there is a difference between the indicated value on the instrumen and the actual value. The instruments could be improperly calibrated or just not be working right to cause this error.

Statistical error is when there is deviation between the actual and predicted values, and this is estimated by standard error or the measure of uncertainty.

Method error is due to the inherent limitation of the data, sample or trait being observed. This could be due to sample size or composition, the fact that the trait is not a good fit for the method being developed.

There is also the matter of the data or variables being used. Variables can be split into two categories, numerical and categorical.

Numerical data will always be more precise than categorical data. Numerical data is split into discrete data and continuous data.

Discrete data are whole numbers, such as counting data, while continuous data is any value within a range, such as a long bone measurement.

Categorical data is split into ordinal and nominal data types.

Ordinal data refers to data that can be categorized and ordered, such as the scale for sexual dimorphism.

Nominal data does not have any intrinsic order, such as nasal aperture shape. When it comes to these categorical variables, the more scales or categories that there are, the more unreliable the data becomes. If there are 7 scales of expression instead of 3 scales of expression, the difference between the stages in the 7-stage system will be much smaller than the differences in stages in the 3-stage system. The 7 stage system will have higher interobserver error as people are more likely to agree on the same thing when there are fewer options.

We also have different ways to deal with different types of data. If you have labeled data, you would use a supervised machine learning method.

Supervised learning is a model that learns from training data sets.The algorithm knows the inputs and outputs from the sample and uses this information to try and generalize to new examples it has never seen before. This type of model is used in classifications and regressions.

Classification methods are used when the output is a discrete class label, such as male or female.

Regression method are used when the output is a continuous value, such as an age.

For unlabeled data you would use unsupervised machine learning to figure out how the data groups together naturally without any assumptions.

Unsupervised learning is when the model works on its own to discover the hidden structure of the data. The algorithm is not given any labels and is expected to make sense of the data by looking at patterns. This would be your cluster analyses or dimensionality reductions.

Clustering methods are used when you want the algorithm to group similar experiences together, such as the creation of population affinity groups.

Dimensionality reduction is often used in the preprocessing stage, and is when the algorithm reduces the number of variables in the data while preserving as much of the information as possible. For example, creating a line of best fit through a scatter plot and then pretending that all of the points fall along that line by transposing their x value is an example of reducing the dimensions of that data.

Supervised learning is more accurate though it requires more human intervention, while unsupervised data is inferential and doesn’t need human help, though it doesn’t make predictions.

Methods are built on samples. Power analysis is a method to determine the minimum sample size for quantitative research. Smaller samples are more susceptible to the effects of outliers and violations of assumptions than larger sample sizes. Having more diversity within your sample allows your model to train better, allowing it to become more valid and more generalizable.

When creating a model, you use training and testing data. You train your model on some sample and test your model on some sample to get its error rate. Test error captures external validity by assessing how well the model performs on a sample that was not included in the model training. Training error refers to how well the model performs on the training set.

To get the testing error, you can perform a hold out/ independent test set, K-fold, LOO, or bootstrap.

A hold/out or independent test set is the best validation method for the largest samples. This is because it requires some of your sample to be set aside and used only for model testing and not training.

A k-fold is the second best for larger sample sizes. In this method, the full set of training data is put into groups called folds. The model is tested the same number of times as the number of folds, and each time one fold is used as the testing set. This means that every group would have tested and trained the model.

In LOO, one data point at a time is removed to train the set and then the model is tested on that point.

Finally, bootstrapping is best for the smallest sample sizes. This is also called sampling with replacement. In bootstrapping you pull out some data points into a group, but every time you pull that data point out it gets put back in the pool before you choose again. This way one data point can be represented multiple times to make new combinations of data.

Models can either be underfitted, overfitted or good fits.

An under fitted model has a low training and test error, meaning the wrong model was probably used for the data type.

An overfit model will have low training error but high testing error, indicating that the model is over optimistic for the sample and thus not generalizable. A good fit is in the middle.