1/97
Flashcards for Text Analytics and Mining, Big Data, and Social Analytics
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Text Analytics
Information retrieval + Text Mining; also defined as information retrieval + info extraction + data mining + web mining
Text Mining
The semiautomated process of extracting patterns from large amounts of unstructured data sources.
Information Extraction
Identification of key phrases and relationships within text by looking for predefined objects and sequences in text by way of pattern matching.
Topic Tracking
Based on a user profile and documents that a user views, text mining can predict other documents of interest to the user.
Categorization
Identifying the main themes of a document and then placing the document into a predefined set of categories based on those themes.
Clustering
is an unsupervised process whereby objects are classified into “natural”
groups called clusters
Concept Linking
Connects related documents by identifying their shared concepts and helps users find information that they perhaps would not have found using traditional search methods.
Question Answering
Finding the best answer to a given question through knowledge-driven pattern matching.
Corpus
A large and structured set of texts (usually stored and processed electronically) prepared for conducting knowledge discovery.
Term
A single word or multi-word phrase extracted directly from the corpus of a specific domain by means of NLP methods.
Concepts
Features generated from a collection of documents by means of manual, statistical, rule-based, or hybrid categorization methodology.
Stemming
The process of reducing inflected words to their stem (or base or root) form.
Synonyms
Syntactically different words (i.e., spelled differently) with identical or at least similar meanings.
Polysemy (Homonyms)
Syntactically identical words (i.e., spelled exactly the same) with different meanings.
Tokenizing
Assignment of meaning to blocks of text; a categorized block of text in a sentence.
Term Dictionary
A collection of terms specific to a narrow field that can be used to restrict the extracted terms within a corpus.
Morphology
A branch of the field of linguistics and a part of NLP that studies the internal structure of words (patterns of word formation within a language or across languages).
Term-by-Document Matrix (Occurrence Matrix)
A common representation schema of the frequency-based relationship between terms and documents in tabular format.
Singular Value Decomposition (Latent Semantic Indexing)
A dimensionality reduction method used to transform the term-by-document matrix to a manageable size by generating an intermediate representation of the frequencies.
Bag-of-Words Model
Text is represented as a collection of words, disregarding the grammar or the order in which the words appear.
Natural Language Processing (NLP)
Using a natural language processor to interface with a computer-based system; studies the problem of “understanding” the natural human language.
WordNet
A hand-coded database of English words, their definitions, sets of synonyms, and various semantic relations between synonym sets; a major resource for NLP applications.
Sentiment Analysis
A technique used to detect favorable and unfavorable opinions toward specific products and services using a large number of textual data sources.
Voice of Customer (VOC)
Applications that focus on 'who and how' questions by gathering and reporting direct feedback from site visitors.
Voice of the Market (VOM)
About understanding aggregate opinions and trends; about knowing what stakeholders are saying.
Voice of the Employee (VOE)
Using rich, opinionated textual data as an effective and efficient way to listen to what employees are saying.
Sentiment Detection
The goal is to differentiate between a fact and an opinion, which may be viewed as classification of text as objective or subjective.
N-P Polarity Classification
Given an opinionated piece of text, the goal is to classify the opinion as falling under one of two opposing sentiment polarities, or locate its position on the continuum between these two polarities.
Polarity Identification
The process of identifying negative or positive connotations in text (in sentiment analysis).
Web Mining
The process of discovering intrinsic relationships from Web data, which are expressed in textual, linkage, or usage information.
Web Content Mining
The extraction of useful information from Web pages.
Web Crawlers
Applications used to read through the content of a Web site automatically.
Authoritative Pages
Web pages that are identified as particularly popular based on links by other Web pages and directories.
Hub
One or more Web pages that provide a collection of links to authoritative pages.
Web Usage Mining
The process of extracting useful information from the links embedded in Web documents.
Search Engine
A software program that searches for documents based on the keywords users have provided.
Search Engine Optimization (SEO)
Techniques to improve a site’s visibility in unpaid (organic) search results.
White-Hat SEO
Following search engine guidelines; focuses on quality content for users.
Black-Hat SEO
Violating search engine rules; uses deception; risks penalties or site removal from search results.
Web Usage Mining
Is the extraction of useful information from data generated through Web page visits and transactions; also called Web analytics.
Clickstream Analysis
The analysis of data that occurs in the Web environment.
Off-Site Web Analytics
Web measurement and analysis about you and your products that takes place outside your Web site.
On-Site Web Analytics
On-site visitor measurement; measure visitors’ behavior once they are on your Web site.
Social Analytics
Mining the textual content created in social media and analyzing socially established networks for gaining insight about existing and potential customers.
Social Network
A social structure composed of individuals (or groups) linked to one another with some type of connections/relationships.
Homophily
The extend to which actors form ties with similar versus dissimilar others.
Multiplexity
The number of content forms contained in a tie.
Mutuality/Reciprocity
The extend to which two actors reciprocate each other’s friendship or other interaction.
Network Closure
A measure of the completeness of relational triads; an individual’s assumption of network closure is called transitivity.
Propinquity
The tendency for actors to have more ties with geographically close others.
Bridge
An individual whose weak ties fill a structural hole, providing the only link between two individuals or clusters.
Centrality
Refers to a group of metrics that aim to quantify the importance or influence of a particular node (or group) within a network.
Density
The proportion of direct ties in a network relative to the total number possible.
Distance
The minimum number of ties required to connect two particular actors.
Structural Holes
The absence of ties between two parts of a network; finding and exploiting can give an entrepreneur a competitive advantage.
Tie Strength
Defined by the linear combination of time, emotional intensity, intimacy, and reciprocity; strong ties are associated with homophily, propinquity, and transitivity, whereas weak ties are associated with bridges.
Cliques and Social Circles
Groups identified as cliques if every individual is directly tied to every other individual or social circles if there is less stringency of direct contact, which is imprecise, or as structurally cohesive blocks if precision is wanted.
Clustering Coefficient
A measure of the likelihood that two members of a node are associates; a higher one indicates a greater cliquishness.
Cohesion
The degree to which actors are connected directly to each other by cohesive bonds.
Social Media
Refers to the enabling technologies of social interactions among people in which they create, share, and exchange information, ideas, and opinions in virtual communities and networks.
Descriptive Analytics
Uses simple statistics to identify activity characteristics and trends.
Social Network Analysis
Follows the links between friends, fans, and followers to identify connections of influence as well as the biggest sources of influence.
Advanced Analytics
Includes predictive analytics and text analytics that examine the content in online conversations to identify themes, sentiments, and connections that would not be revealed by casual surveillance.
Big Data
Data characterized by volume, variety, and velocity that exceeds the reach of commonly used hardware environments and/or capabilities of software tools to process.
Veracity
Conformity to facts: accuracy, quality, truthfulness, or trustworthiness of the data.
Variability
Data flows can be highly inconsistent with periodic peaks.
Value Proposition (of Big Data)
Big analytics means greater insight and better decisions, something that every organization needs.
In-Memory Analytics
Allows analytical computations and Big Data to be processed in-memory and distributed across a dedicated set of nodes.
In-Database Analytics
Perform data integration and analytic functions inside the database so you won’t have to move or convert data repeatedly.
Grid Computing
Process jobs in a shared, centrally managed pool of IT resources.
Appliance (In the context of Big Data)
Brings together hardware and software in a physical unit that is not only fast but also scalable on an as-needed basis.
MapReduce
A technique to distribute the processing of very large multi structured data files across a large cluster of machines.
Hadoop
An open source framework for processing, storing, and analyzing massive amounts of distributed, unstructured data.
Hadoop Distributed File System (HDFS)
A distributed file management system that lends itself well to processing large volumes of unstructured data.
Data Scientist
Manipulate and analyze data using tools for searching hidden insights and patterns, or use as the foundation for building user-facing analytic applications.
Name Node
The node in a Hadoop cluster that provides the client information on where in the cluster particular data is stored and if any nodes fail.
Secondary Node
A backup to the Name Node, it periodically replicates and stores data from the Name Node should it fail.
Job Tracker
The node in a Hadoop cluster that initiates and coordinates MapReduce jobs or the processing of the data.
Slave Nodes
The grunts of any Hadoop cluster, slave nodes store data and take direction to process it from the Job Tracker.
Hive
Hadoop-based data warehousing–a framework developed by Facebook; allows users to write queries in an SQL-like language called HiveQL
Pig
A Hadoop-based query language developed by Yahoo!
Teradata Aster
A Big Data platform for distributed storage and processing of large multi structured data sets; used for marketing optimization, fraud detection, sports analytics, social networking analysis, machine data analytics, energy analytics, etc.
Stream Analytics
Analytic process of extracting actionable information from continuously flowing/streaming data.
Perpetual Analytics
An analytics practice that continuously evaluates every incoming data point against all prior observations to identify patterns/anomalies.
Critical Event Processing
Method of capturing, tracking, and analyzing streams of data to detect events of certain types that are worthy of the effort.
Data Stream Mining
The process of extracting novel patterns and knowledge structures from continuous, rapid data records.
Summarization
Summarizing a document to save time
Challenges assoicated with implementation of NLP
Part-of-speech tagging:
Text segmentation
Word sense disambiguation
Syntactic ambiguity
Imperfect or irregular input
Speech Acts
Deception detection
Applying text mining to a large set of real-world criminal (person-of-interest) statements developed prediction models to differentiate deceptive statements from truthful ones. Using a rich set of cues extracted from the textual statements
Part-of-Speech Tagging
The process of marking up the words in a text as corresponding to a particular part of speech such as nouns, verbs, adjectives, etc., to help with analysis and understanding of the text's structure.
Term-document matrix (TDM):
A frequency matrix created from digitized and
organized documents (the corpus). Rows represent the documents and columns
represent the terms. The relationships between the terms and documents are
characterized by indices
Singular value decomposition (SVD):
Closely related to principal components analysis,
it reduces the overall dimensionality of the input matrix (number of input documents by
number of extracted terms) to a lower dimensional space, where each consecutive
dimension represents the largest degree of variability (between words and documents
Classification
Supervised induction used to analyze the historical data stored in a
database and to automatically generate a model that can predict future behavio
Scatter/gather:
This document browsing method uses clustering to enhance the efficiency of human browsing of documents when a specific search query cannot be formulated. In a sense, the method dynamically generates a table of contents for the collection and adapts and modifies it in response to the user selection
Query-specific clustering:
This method employs a hierarchical clustering approach
where the most relevant documents to the posed query appear in small tight clusters that
are nested in larger clusters containing less-similar documents, creating a spectrum of
relevance levels among the documents. This method performs consistently well for
document collections of realistically large sizes
Trend Analysis:
notion that the various types of concept distributions are functions of
document collections; that is, different collections lead to different concept distributions
for the same set of concepts. It is, therefore, possible to compare two distributions that
are otherwise identical except that they are from different subcollection
Factors fpr Big Data Analytics
Clear business need (aligning with vision and strategy)
Strong, committed sponsorship (executive champion)
Alignment between business & It strategy
Fact-based decision-making culture
Strong data infrastructure:
Challenges of Big Data
Data volume: The ability to capture, store, and process a huge volume of data at an acceptable speed
Data integration: The ability to combine data that is not similar in structure or source and to do so quickly and at a reasonable cost.
Processing capabilities: The ability to process data quickly, as it is captured.
Data governance: The ability to keep up with the security, privacy, ownership, and quality issues of Big Data. Capabilities of governance practices should adapt
Skills availability: shortage of people (often called data scientists) with skills to do a job of using new tools and looking at data in different ways
Solution cost: To ensure a positive return on investment, crucial to reduce the cost of the solutions