Geospatial Analysis of Social Media

Introduction

  • Geospatial analysis deals with geographic references in blogs, tweets, comments, images, videos, news feeds, online games, and social media.
  • Social media platforms track user locations implicitly, recording where people go and search for information.
  • Users voluntarily share travel logs, destination locations, hotels, restaurants, and location-aware images.
  • Citizens are now sensors actively collecting and contributing geospatial information (Goodchild, 2007).
  • Crowd-sourced datasets reduce data collection burden (Stefanidis et al., 2013) and offer opportunities to study:
    • Human movements (De Longueville et al., 2009)
    • Segregation (Hedayatifar et al., 2019; Morales et al., 2019)
    • Polarization of opinions (Himelboim et al., 2013; Yardi and boyd, 2010)
    • Distribution of languages (Mocanu et al., 2013)
    • Vectors of political campaigns (Jungherr, 2016)
  • Help estimate population densities and predict geospatial contagion of diseases (Zhang, 2014).
  • This chapter gives insights into using volunteered geospatial information and analyzing human activities and information flows.
  • The authors use a multidisciplinary approach, integrating techniques from geography, visual analytics, social sciences, information science, complex systems, and data science.
  • Geographic analyses are integrated with semantic analysis, machine learning, and network analysis to enhance the understanding of spatio-temporal contexts.
  • Geospatial analyses offer extra dimensions for triangulation methodologies, capturing different aspects of phenomena.
  • Researchers use geographic locations to understand locations, explore insights about other phenomena, and model communities, economic impacts, political impacts, communication patterns, and emergencies.

Background Information

  • Geospatial references vary in types, meanings, sources, formats, accuracy, precision, and certainty/uncertainty.
  • These features affect the quality, validity, credibility, and trustworthiness of spatial analyses.
  • Geographic locations are not always the primary focus, but rather spatial behaviors of people, objects, relationships, events, communities, and activities.
  • Geospatial analysis should consider properties of people, communities, events, and relationships alongside geospatial locations.
  • Relationships exist between different types of locations, e.g., home and work locations, places of interest (SafeGraph).
  • SafeGraph's datasets describe social distancing during COVID-19, measured by relationships between location types.
    • Home and workplaces are differentiated by the amount of time spent at each location and at what time of the day.
  • Studies show that geospatial locations are not always shared by users.
    • Java et al. (2007) reported that 52% of Twitter users included in their study had geospatial locations.
    • Hecht et al. (2011) observed that two out of three users in their study had information about their geospatial locations.
    • Cheng et al. (2010) reported that 5% of users in their study listed locations in the form of coordinates, and 21% reported locations at the city level.
  • Locational information can be inferred from user-posted content (Backstrom et al., 2010; Cheng et al., 2010; Popescu and Grefenstette, 2010).
  • MacEachren et al. (2011b) studied Twitter use for crisis management and reported low geolocation usage.
  • Stefanidis et al. (2013) suggested that variations in location reporting can be attributed to an uneven distribution of the latest mobile devices.
  • Geographic references denote users’ home locations, events, and spatial coverage of microblog entries or images.
  • Geographic references may refer to static or dynamic states and events.
    • Home locations (static).
    • Tweets (ambient, momentary social hotspots or events).
  • Spatial information in social media is recorded in two forms:
    • Geospatial footprints: spatial location expressed in coordinates (Hill, 2006).
      • Can be a dot, line, polygon, set of dots, boundary box, image, or pixels.
      • Required for mapping references.
    • Text: references to place names.
      • Expressed in different languages, codes, tags, mailing addresses, postal codes, time zones, or IP addresses.
      • Assigning footprints made easy by geocoding services, but accuracy can vary.
      • Geocoding services might fail to recognize or match location names due to changes, transliterations, homonyms, and variant spellings.
  • Feature types are natural and cultural categories of geospatial locations (Hill, 2006).
    • Natural features: continents, mountains, lakes, seas, forests, grasslands.
    • Cultural features: types of businesses, man-made constructions, and places.
    • Improve accuracy and precision for information retrieval and help interpret context.
  • Social media platforms offer location check-in services (e.g., Facebook) that provide places of interest.
  • Check-in data allows studying frequencies of check-ins and can build models to predict next check-in locations and improve advertisement targeting (e.g. Chang and Sun, 2011).
  • Accuracy: the degree to which the recorded value represents the 'correct' value (Hill, 2006: 227).
    • Studies show error varies in urban environments from 7 to 13 meters (Merry and Bettinger, 2019).
      • Correlated with time of day, high WiFi usage, and presence of buildings.
    • IP geocoding can locate a city or country (Gharaibeh et al., 2017; Livadariu et al., 2020).
    • Accuracy data varies by country (MaxMind, 2020), e.g., Luxembourg (8%), Algeria (28%), USA (68%), Canada (55%).
  • Precision refers to the potential amount of geographic extent represented by the locality.
    • Coordinates are more precise and accurate than textual references.
    • Textual references are translated into coordinates, commonly represented as a central point or bounding box.
  • Scale is the ratio between linear distance on the map and corresponding distance on Earth's surface (Longley et al., 2005).
  • Geographic references form a semantic space of their own (Buchel, 2013; Fabrikant, 2001a, 2001b; Fabrikant and Buttenfield, 2001; Fabrikant and Skupin, 2005).
    • Defined by implicit relationships among geographic references, like countries and provinces, forming a semantic hierarchy.
  • Accuracy, precision, and scale affect spatial uncertainty of geospatial representations.
  • Uncertainty is the difference between a real geographic phenomenon and the user's understanding of it (Longley et al., 2005).
  • Semantic uncertainty arises when social media users give different meanings to the same term, phrase, or action, leading to false conclusions.
  • Semantic uncertainty can be reduced and increase geospatial accuracy and precision by feature types (Han et al., 2012).

Analysis Roadmap

  • Geospatial analysis involves understanding how microblogs or social media artifacts relate to other data or phenomena.
  • Measuring geospatial footprints of online and offline communities, determining their volume, proximity, identifying spatial clusters, locating hot spots, and making predictions about outcomes of events (ESRI, 2020).
  • Requires interdisciplinary skills, including semantic, network, retrieval, and statistical analyses, machine learning and AI, and programming skills.
  • The Geospatial Analysis pipeline mirrors the data science pipeline:
    • Research question.
    • Data collection.
    • Data preparation (geospatial coordinates assignment, accuracy, aggregation).
    • Exploratory, confirmatory, and predictive analyses.
    • Presentation.
  • Confirmatory and predictive analysis can lead to more hypothesis generation, triggering additional testing.

Research Questions

  • Geospatial analysis answers multidisciplinary research questions.
    • Forecasting political opinions (Sobkowicz et al., 2012).
    • Identifying and mapping global virtual communities (Hedayatifar et al., 2019; Morales et al., 2019; Stefanidis et al., 2013).
    • Making meteorological observations (Hyvärinen and Saltikoff, 2010).
    • Studying structure, dynamics, and rhythms of natural cities (Jiang and Miao, 2015; Morales et al., 2017).
    • Making observations about street networks (Boeing, 2017).
    • Tracking infectious diseases (Padmanabhan et al., 2013).
    • Managing crisis situations (MacEachren et al., 2011a).
    • Capturing human movement patterns across borders (Blanford et al., 2015).
    • Discovering significant events and patterns (Andrienko et al., 2010).
    • Understanding protest movements (Gleason, 2013).
    • Finding geographic patterns and correlations in communication networks and languages (Conover et al., 2013; Morales et al., 2019; Mocanu et al., 2013).
    • Fine-tuning communication or marketing strategies (Bhattacharya et al., 2019).
  • Researchers use maps to:
    • Report findings.
    • Verify social media reliability.
    • Discover new patterns and insights.
    • Generate hypotheses.
    • Understand laws about movements.
  • GIScience community techniques include identifying clusters, classifications, predictive modeling, and exploratory analysis.

Sampling

  • Studies reviewed use statistical sampling techniques in semantic, geospatial, and temporal contexts.
  • Semantic sampling involves semantic filtering via queries.
    • Purcell and de Beurs (2013) used textual queries for weather data but faced noisy tweets and homonyms.
    • Noise, multiple variables, and uncertainty may hinder exploration/hypothesis generation and decision-making.
  • Query expansion techniques from information science and NLP can improve samples.
    • Using a thesaurus to expand queries with synonyms and related words.
    • Term weighting based on distance from query words.
    • Massoudi et al. (2011) suggested quality indicators (emoticons, post length, shouting, capitalization, hyperlinks, reposts, followers, tweet recency) to model microblog post retrieval.
    • Contextual searches include event detection (Sayyadi et al., 2009) and mining opinions (Sobkowicz et al., 2012).
  • Hashtags are used for tweet retrieval, analysts should collect all relevant hashtags adopted by platform users.
  • Microblog sample size isn't representative of populations.
    • Populations can be topic/relationship-bound or unbound, requiring different collection techniques (Rafail, 2018).
    • Hashtag samples may have narrower coverage than relationship-bound samples (Morstatter et al., 2014), misrepresenting activity and leading to erroneous conclusions (Rafail, 2018).
  • Geospatial sampling narrows samples to locations (radius/bounding box) or stratifies samples.
    • Stefanidis et al.'s (2013) study retrieved tweets within a 10-kilometer radius from Tahrir Square.
    • Stratified designs parse geographic areas into census blocks, tracts, or grids for random sampling (Eckman and Himelein, 2020).
  • Time is a crucial sampling parameter.
    • Different sampling periods give insights into spatio-temporal population dynamics of a city (Oki, 2018).
    • Space dynamics change with citizen use patterns but are stable over weeks or months.

Data Pre-processing

  • Extracting geographic place names (geoparsing) from short comments and tweets is different from grammatically correct texts.
  • Locations in microblog posts may lack capitalization and proper spelling (e.g., British Columbia versus bc).
  • Gelernter and Mushegian's (2011) analysis showed named entity recognition software fails when names are not capitalized, have non-standard abbreviations, or have misspellings.
  • Karimzadeh et al. (2013) claimed GeoTxt API solves this issue, but whether it relies on feature types for disambiguation is unclear.
  • Geotagging is adding geographical identification metadata (coordinates, altitude, heading direction, place names) to resources (websites, RSS feeds, images, videos) (Torniai et al., 2007: 159).
  • Newer cameras have GPS receivers while users add semantic geotags (e.g., Paris).
  • Rorissa et al. (2012) extracted geotags from Flickr images, finding no statistical significance with levels of abstraction but suggesting ways to help users find photographs online.
  • O'Hare and Murdock (2013) used tags and geotags to predict photo locations with varying accuracy.
  • Kipp et al. (2014) extracted location information from image descriptions, titles, and commenter accounts on Flickr, facing difficulties with capitalization in titles.
  • Forward geocoding: assigning coordinates for addresses.
    • Services: Google Maps, Bing Maps, Foursquare Fsq.io, MapQuest.
  • Reverse geocoding: mapping coordinates to addresses and toponyms can be complicated by uncertainty.

Spatial Aggregation

  • Spatial information is reported by administrative entities (countries, counties/provinces, census tracts/blocks) due to ease of understanding.
  • Shortcomings of administrative areas:
    • Inability to show exact boundaries of specific activities (social ones may occupy a small part of the entity).
    • The number of geospatial entities is fixed and low.
    • For regression analysis, the number may not be enough and can lead to Modifiable Areal Unit Problem (Openshaw, 1983).
  • Spatial aggregation methods like buffers, Voronoi diagrams, and geohashing overcome limitations and facilitate the privacy of people whose spatial activities are analyzed (Keßler and McKenzie, 2018).

Spatio-Temporal Hashing

  • Hashing is a computer science technique using hash tables to map keys to values.
  • Other techniques: R-tree (Beckmann et al., 1990; Guttman, 1984), Hilbert R-tree (Kamel and Faloutsos, 1993), and Quadtree (Samet, 1984).
  • Hash table is an abstract data type used for efficient element additions and deletions.
  • Spatio-temporal maps use geohashing (coordinates into discrete symbols) and time hashing (temporal aspects).
  • Geohash index partitions the globe into a grid of fixed-size squares, each representing a bounding box (Liu et al., 2014).
  • Geohash function returns a string for a bounding box.
  • The geohash scheme is recursive: boxes are subdivided and labeled at each level.
  • In Python, pygeohash (McGinnis, 2015) can be used, where a five-character index indicates a bounding box of width 4900 and height of 4900 meters (e.g., coordinates (42.6, −5.6) correspond to geohash 'ezs42' at precision level 5).
  • Timehash converts time into a 12-digit representation:
    • Two digits each for year, month, day, hour, minute, and second (YYMMDDHHMMSS).
    • More precise timehash = smaller timebox.
    • A 60-second window corresponds to 10 digits (YYMMDDHHMM).
  • Hashing offers computational efficiency for processing large data amounts.
  • Hashing reduces data, enhancing pattern discovery in Big Data.
  • Hashing supports the discovery of co-traveling and co-communicating entities due to the potential of it being an event of interest to users.
  • Hashing is critical for privacy protection.
  • Geohashing explores multiple scales of phenomena (e.g., inequalities).
  • Multi-scale geohashing and time hashing can be found in ElasticSearch and Kepler.gl.

Augmenting Spatial Data

  • Social media lacks sociodemographic data.
  • Census data can infer ethnicity, income, and education based on locations.
  • Census properties are associated with grids, census blocks, or administrative entities.
  • Researchers can merge geospatial data with survey results, OpenStreet map, real estate prices, and climate datasets.
  • Urban open-source data and Python packages (OSMnx and networkX) augment locations with information about street centrality, accessibility, and walkability (Boeing, 2017, 2019).
  • They also describe temporal properties.
  • Additional descriptors may answer questions about path dependencies.

Geospatial Analysis Methods

  • The analysis aims to gain spatial data insights and find correlations/models of microblogger behavior.
  • Models are mathematical representations of data in geographic space.
  • Techniques for analyzing spatial heterogeneity, distribution, neighbors, and behaviors.
  • These use locations of spatial neighbors, characteristics of relationships, spatial buffers, and other descriptors.
  • Methods
    • Geographically weighted summaries.
    • Density maps.
    • Heat maps.
    • Space-time cube models.
    • Flow maps.
    • Voronoi diagrams.
    • Spatial deviational ellipses.
    • Classification of spatial behaviors with machine learning methods.

Exploratory Analysis

  • Kepler.gl is an online tool for mapping.
  • Large tweet sets can be represented with dots, grid maps, heat maps, and 3D hex bin maps.
    *Figure 19.2 shows how a large set of tweets with geospatial coordinates can be represented with different exploratory methods ( a) dots, b) grid maps, c) heat maps, and d) 3D hex bin maps.
  • Grids transform dots into a matrix for analysis and modeling and Kepler.gl is suitable is datasets aren't large.
  • Heat maps aggregate spatial buffers.
    • Buffering designates activity areas around objects of interest.
      • Figure 19.2c, buffers equal to 10 km is set. The heat map shows a collection of buffers which represent the collective behaviors of microbloggers.
    • Buffers link social media users' activities to the social fabric of cities.
    • Social scientists might link the radius of buffers to bloggers’ spatial influences.
    • Areas of influence have been analyzed for different businesses.
    • Catchment areas might be inferred.
  • Hex bins can also be created showing accumulations, by encoding counts with color and height.
  • Kepler.gl is an example of a geovisualization tool which support contextual and spatio-temporal filtering enabling powerful question answering and insight generation in real time in addition to revealing geospatial patterns.

GeoTime, Space-Time Cube Model, and Flowmap.Blue

  • GeoTime is a commercial geovisualization tool capable of detecting geo-temporal patterns and integrating narration in analytical processes.
    • Figure 19.3 shows a snapshot from a GeoTime story with the log of activities in banks, car dealerships, and other places.
  • GeoTime brought to life the famous space-time model first envisioned by Hägerstrand in 1970.
  • Kepler.gl and flowmap.blue facilitate analysis of static and dynamic geospatial relationships.
    • Figure 19.4 shows a map of COVID-19 spread via air travel in flowmap.blue.

Exploration of Spatial Heterogeneity

  • Methods aggregate data into equal cells.
  • Quantities may not show patterns at first glance.
  • Geographers consider average values of neighbors using geographically weighted summaries or nearest neighbor analysis (Brunsdon et al., 1996).
  • The moving window technique smooths areas with uneven values showing a more even pattern of the area.
  • Analysts see a picture of means, standard deviations, and skewness computed for specific locations
  • Analysts may use geographically weighted summary statistics to determine whether the density of phenomena observed spatially is homogenous or heterogeneous.

Voronoi Diagrams

  • Voronoi diagram is a grouping technique to divide a map into geographic regions that are not equivalent to geographic regions.
  • Implies all points inside a polygon are closest to its centroid.
  • Used to find the largest empty circle amid a set of points.
  • Applications in astronomy, business analytics, and soil analysis
  • John Snow used them for geospatial analysis.

Standard Deviational Ellipses

  • Activity areas do not fit a circle, grid, or hex bin; they are better described with standard deviational ellipses.
  • Introduced by Lefever (1926).
  • Shows the orientation of distribution.
  • Shows distribution concentration/dispersion around its center: major axis shows dispersion orientation, minor axis shows minimum dispersion, and ellipse area indicates spread (Gong, 2002).
  • Used for collective and individual behavior.
  • Ellipses can be predictive tools.

Machine Learning Techniques for Classifying Human Behaviors

  • PCA and k-means clustering simplify a large number of attributes.
  • PCA transforms values into components, each explaining a variance percentage.
  • Coupled with k-means, to define classes.
  • Color-coding classes on the map reveals patterns.
    *Classification show which classes are playing the most important role in each area.

Confirmatory/Predictive Models

  • Models study complex relationships among variables.

Correlations and Predictive Models

  • Finding correlations between social media and demographic data is trendy (Nguyen et al., 2017; Vydiswaran et al., 2018, 2020).
  • Geographically weighted regression (GWR) is a spatial analysis technique modeling local relationships (Brunsdon et al., 1996).
    • GWR constructs separate OLS regression equations for every location, outputs on a map highlight local variations in R2R^2 and variables.

Geo-Social Visual Analytics

  • Integrates social network perspectives and methods into approach and tools for scientific reasoning (Luo and MacEachren, 2014: 29).
  • Aims to bridge geography, social science, and network analysis.
  • Luo et al. (2011) designed the GeoSocial App, which reveals how groups identified in networks are positioned in geographic space over time and it enables capturing the dynamics of social relationships, which is hard to understand from static graphs.
  • Koylu et al.'s method take into account distance, time (duration of interaction between individuals), and type of social relationship between each pair of individuals, which are represented on a kernel density map (see Figure 19.10).

Conclusion

  • Geospatial analysis of social media is complex.
  • Requires an interdisciplinary approach grounded in geospatial, linguistic, information retrieval, complexity, and data science theories.
  • Requires knowledge in data aggregation techniques for multiple scales, data pre-processing, and enrichment methods.
  • Essential to know methods for analyzing behaviors in space and spatial heterogeneities.
  • Users embedded in social and geospatial networks are influenced by both, converging or diverging in opinions.
  • Consequence patterns can be observed through segregation, balkanization, and polarization both online and in geographic space.
  • Geospatial analyses are beneficial for studying inequalities, vulnerabilities, risks, and segregation patterns.
  • Modern maps are tools for studying dynamics, leading to a better understanding of systemic risks in societies, economics, and health.