Text Analytics, Big Data, and Social Analytics Flashcards

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/97

flashcard set

Earn XP

Description and Tags

Flashcards for Text Analytics and Mining, Big Data, and Social Analytics

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

98 Terms

1
New cards

Text Analytics

Information retrieval + Text Mining; also defined as information retrieval + info extraction + data mining + web mining

2
New cards

Text Mining

The semiautomated process of extracting patterns from large amounts of unstructured data sources.

3
New cards

Information Extraction

Identification of key phrases and relationships within text by looking for predefined objects and sequences in text by way of pattern matching.

4
New cards

Topic Tracking

Based on a user profile and documents that a user views, text mining can predict other documents of interest to the user.

5
New cards

Categorization

Identifying the main themes of a document and then placing the document into a predefined set of categories based on those themes.

6
New cards

Clustering

is an unsupervised process whereby objects are classified into “natural”

groups called clusters

7
New cards

Concept Linking

Connects related documents by identifying their shared concepts and helps users find information that they perhaps would not have found using traditional search methods.

8
New cards

Question Answering

Finding the best answer to a given question through knowledge-driven pattern matching.

9
New cards

Corpus

A large and structured set of texts (usually stored and processed electronically) prepared for conducting knowledge discovery.

10
New cards

Term

A single word or multi-word phrase extracted directly from the corpus of a specific domain by means of NLP methods.

11
New cards

Concepts

Features generated from a collection of documents by means of manual, statistical, rule-based, or hybrid categorization methodology.

12
New cards

Stemming

The process of reducing inflected words to their stem (or base or root) form.

13
New cards

Synonyms

Syntactically different words (i.e., spelled differently) with identical or at least similar meanings.

14
New cards

Polysemy (Homonyms)

Syntactically identical words (i.e., spelled exactly the same) with different meanings.

15
New cards

Tokenizing

Assignment of meaning to blocks of text; a categorized block of text in a sentence.

16
New cards

Term Dictionary

A collection of terms specific to a narrow field that can be used to restrict the extracted terms within a corpus.

17
New cards

Morphology

A branch of the field of linguistics and a part of NLP that studies the internal structure of words (patterns of word formation within a language or across languages).

18
New cards

Term-by-Document Matrix (Occurrence Matrix)

A common representation schema of the frequency-based relationship between terms and documents in tabular format.

19
New cards

Singular Value Decomposition (Latent Semantic Indexing)

A dimensionality reduction method used to transform the term-by-document matrix to a manageable size by generating an intermediate representation of the frequencies.

20
New cards

Bag-of-Words Model

Text is represented as a collection of words, disregarding the grammar or the order in which the words appear.

21
New cards

Natural Language Processing (NLP)

Using a natural language processor to interface with a computer-based system; studies the problem of “understanding” the natural human language.

22
New cards

WordNet

A hand-coded database of English words, their definitions, sets of synonyms, and various semantic relations between synonym sets; a major resource for NLP applications.

23
New cards

Sentiment Analysis

A technique used to detect favorable and unfavorable opinions toward specific products and services using a large number of textual data sources.

24
New cards

Voice of Customer (VOC)

Applications that focus on 'who and how' questions by gathering and reporting direct feedback from site visitors.

25
New cards

Voice of the Market (VOM)

About understanding aggregate opinions and trends; about knowing what stakeholders are saying.

26
New cards

Voice of the Employee (VOE)

Using rich, opinionated textual data as an effective and efficient way to listen to what employees are saying.

27
New cards

Sentiment Detection

The goal is to differentiate between a fact and an opinion, which may be viewed as classification of text as objective or subjective.

28
New cards

N-P Polarity Classification

Given an opinionated piece of text, the goal is to classify the opinion as falling under one of two opposing sentiment polarities, or locate its position on the continuum between these two polarities.

29
New cards

Polarity Identification

The process of identifying negative or positive connotations in text (in sentiment analysis).

30
New cards

Web Mining

The process of discovering intrinsic relationships from Web data, which are expressed in textual, linkage, or usage information.

31
New cards

Web Content Mining

The extraction of useful information from Web pages.

32
New cards

Web Crawlers

Applications used to read through the content of a Web site automatically.

33
New cards

Authoritative Pages

Web pages that are identified as particularly popular based on links by other Web pages and directories.

34
New cards

Hub

One or more Web pages that provide a collection of links to authoritative pages.

35
New cards

Web Usage Mining

The process of extracting useful information from the links embedded in Web documents.

36
New cards

Search Engine

A software program that searches for documents based on the keywords users have provided.

37
New cards

Search Engine Optimization (SEO)

Techniques to improve a site’s visibility in unpaid (organic) search results.

38
New cards

White-Hat SEO

Following search engine guidelines; focuses on quality content for users.

39
New cards

Black-Hat SEO

Violating search engine rules; uses deception; risks penalties or site removal from search results.

40
New cards

Web Usage Mining

Is the extraction of useful information from data generated through Web page visits and transactions; also called Web analytics.

41
New cards

Clickstream Analysis

The analysis of data that occurs in the Web environment.

42
New cards

Off-Site Web Analytics

Web measurement and analysis about you and your products that takes place outside your Web site.

43
New cards

On-Site Web Analytics

On-site visitor measurement; measure visitors’ behavior once they are on your Web site.

44
New cards

Social Analytics

Mining the textual content created in social media and analyzing socially established networks for gaining insight about existing and potential customers.

45
New cards

Social Network

A social structure composed of individuals (or groups) linked to one another with some type of connections/relationships.

46
New cards

Homophily

The extend to which actors form ties with similar versus dissimilar others.

47
New cards

Multiplexity

The number of content forms contained in a tie.

48
New cards

Mutuality/Reciprocity

The extend to which two actors reciprocate each other’s friendship or other interaction.

49
New cards

Network Closure

A measure of the completeness of relational triads; an individual’s assumption of network closure is called transitivity.

50
New cards

Propinquity

The tendency for actors to have more ties with geographically close others.

51
New cards

Bridge

An individual whose weak ties fill a structural hole, providing the only link between two individuals or clusters.

52
New cards

Centrality

Refers to a group of metrics that aim to quantify the importance or influence of a particular node (or group) within a network.

53
New cards

Density

The proportion of direct ties in a network relative to the total number possible.

54
New cards

Distance

The minimum number of ties required to connect two particular actors.

55
New cards

Structural Holes

The absence of ties between two parts of a network; finding and exploiting can give an entrepreneur a competitive advantage.

56
New cards

Tie Strength

Defined by the linear combination of time, emotional intensity, intimacy, and reciprocity; strong ties are associated with homophily, propinquity, and transitivity, whereas weak ties are associated with bridges.

57
New cards

Cliques and Social Circles

Groups identified as cliques if every individual is directly tied to every other individual or social circles if there is less stringency of direct contact, which is imprecise, or as structurally cohesive blocks if precision is wanted.

58
New cards

Clustering Coefficient

A measure of the likelihood that two members of a node are associates; a higher one indicates a greater cliquishness.

59
New cards

Cohesion

The degree to which actors are connected directly to each other by cohesive bonds.

60
New cards

Social Media

Refers to the enabling technologies of social interactions among people in which they create, share, and exchange information, ideas, and opinions in virtual communities and networks.

61
New cards

Descriptive Analytics

Uses simple statistics to identify activity characteristics and trends.

62
New cards

Social Network Analysis

Follows the links between friends, fans, and followers to identify connections of influence as well as the biggest sources of influence.

63
New cards

Advanced Analytics

Includes predictive analytics and text analytics that examine the content in online conversations to identify themes, sentiments, and connections that would not be revealed by casual surveillance.

64
New cards

Big Data

Data characterized by volume, variety, and velocity that exceeds the reach of commonly used hardware environments and/or capabilities of software tools to process.

65
New cards

Veracity

Conformity to facts: accuracy, quality, truthfulness, or trustworthiness of the data.

66
New cards

Variability

Data flows can be highly inconsistent with periodic peaks.

67
New cards

Value Proposition (of Big Data)

Big analytics means greater insight and better decisions, something that every organization needs.

68
New cards

In-Memory Analytics

Allows analytical computations and Big Data to be processed in-memory and distributed across a dedicated set of nodes.

69
New cards

In-Database Analytics

Perform data integration and analytic functions inside the database so you won’t have to move or convert data repeatedly.

70
New cards

Grid Computing

Process jobs in a shared, centrally managed pool of IT resources.

71
New cards

Appliance (In the context of Big Data)

Brings together hardware and software in a physical unit that is not only fast but also scalable on an as-needed basis.

72
New cards

MapReduce

A technique to distribute the processing of very large multi structured data files across a large cluster of machines.

73
New cards

Hadoop

An open source framework for processing, storing, and analyzing massive amounts of distributed, unstructured data.

74
New cards

Hadoop Distributed File System (HDFS)

A distributed file management system that lends itself well to processing large volumes of unstructured data.

75
New cards

Data Scientist

Manipulate and analyze data using tools for searching hidden insights and patterns, or use as the foundation for building user-facing analytic applications.

76
New cards

Name Node

The node in a Hadoop cluster that provides the client information on where in the cluster particular data is stored and if any nodes fail.

77
New cards

Secondary Node

A backup to the Name Node, it periodically replicates and stores data from the Name Node should it fail.

78
New cards

Job Tracker

The node in a Hadoop cluster that initiates and coordinates MapReduce jobs or the processing of the data.

79
New cards

Slave Nodes

The grunts of any Hadoop cluster, slave nodes store data and take direction to process it from the Job Tracker.

80
New cards

Hive

Hadoop-based data warehousing–a framework developed by Facebook; allows users to write queries in an SQL-like language called HiveQL

81
New cards

Pig

A Hadoop-based query language developed by Yahoo!

82
New cards

Teradata Aster

A Big Data platform for distributed storage and processing of large multi structured data sets; used for marketing optimization, fraud detection, sports analytics, social networking analysis, machine data analytics, energy analytics, etc.

83
New cards

Stream Analytics

Analytic process of extracting actionable information from continuously flowing/streaming data.

84
New cards

Perpetual Analytics

An analytics practice that continuously evaluates every incoming data point against all prior observations to identify patterns/anomalies.

85
New cards

Critical Event Processing

Method of capturing, tracking, and analyzing streams of data to detect events of certain types that are worthy of the effort.

86
New cards

Data Stream Mining

The process of extracting novel patterns and knowledge structures from continuous, rapid data records.

87
New cards

Summarization

Summarizing a document to save time

88
New cards

Challenges assoicated with implementation of NLP

  1. Part-of-speech tagging:

  2. Text segmentation

  3. Word sense disambiguation

  4. Syntactic ambiguity

  5. Imperfect or irregular input

  6. Speech Acts

89
New cards

Deception detection

Applying text mining to a large set of real-world criminal (person-of-interest) statements developed prediction models to differentiate deceptive statements from truthful ones. Using a rich set of cues extracted from the textual statements

90
New cards

Part-of-Speech Tagging

The process of marking up the words in a text as corresponding to a particular part of speech such as nouns, verbs, adjectives, etc., to help with analysis and understanding of the text's structure.

91
New cards

Term-document matrix (TDM):

A frequency matrix created from digitized and

organized documents (the corpus). Rows represent the documents and columns

represent the terms. The relationships between the terms and documents are

characterized by indices

92
New cards

Singular value decomposition (SVD):

Closely related to principal components analysis,

it reduces the overall dimensionality of the input matrix (number of input documents by

number of extracted terms) to a lower dimensional space, where each consecutive

dimension represents the largest degree of variability (between words and documents

93
New cards

Classification

Supervised induction used to analyze the historical data stored in a

database and to automatically generate a model that can predict future behavio

94
New cards

Scatter/gather:

This document browsing method uses clustering to enhance the efficiency of human browsing of documents when a specific search query cannot be formulated. In a sense, the method dynamically generates a table of contents for the collection and adapts and modifies it in response to the user selection

95
New cards

Query-specific clustering:

This method employs a hierarchical clustering approach

where the most relevant documents to the posed query appear in small tight clusters that

are nested in larger clusters containing less-similar documents, creating a spectrum of

relevance levels among the documents. This method performs consistently well for

document collections of realistically large sizes

96
New cards

Trend Analysis:

notion that the various types of concept distributions are functions of

document collections; that is, different collections lead to different concept distributions

for the same set of concepts. It is, therefore, possible to compare two distributions that

are otherwise identical except that they are from different subcollection

97
New cards

Factors fpr Big Data Analytics

  1. Clear business need (aligning with vision and strategy)

  2. Strong, committed sponsorship (executive champion)

  3. Alignment between business & It strategy

  4. Fact-based decision-making culture 

  5. Strong data infrastructure:

98
New cards

Challenges of Big Data

  1. Data volume: The ability to capture, store, and process a huge volume of data at an acceptable speed

  2. Data integration: The ability to combine data that is not similar in structure or source and to do so quickly and at a reasonable cost.

  3. Processing capabilities: The ability to process data quickly, as it is captured.

  4. Data governance: The ability to keep up with the security, privacy, ownership, and quality issues of Big Data. Capabilities of governance practices should adapt

  5. Skills availability: shortage of people (often called data scientists) with skills to do a job of using new tools and looking at data in different ways 

  6. Solution cost: To ensure a positive return on investment, crucial to reduce the cost of the solutions