1/54
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
Data provisioning
the process of providing users and systems with access to data
Includes maintaining security authorizations to limit access
Provisioning from external sources requires obtaining permission from the sources for acquiring and using unless open source (freely available)
During the provisioning process data may be
replicated or copied
Replication
Ensures that the source data remain intact
Can be performed in real time or in batches
Essential to consider the impact of the replication on performance in the source systems so as to minimize disruption to the orgs. business systems
One reason we replicate from source to analysis system is that the analysis could burden the source system and cause it to slow
Another reason is to transform the data into a more usable form before we analyze it
Structured data:
Are computer readable, highly organized, and searchable
Data sourced from databases, spreadsheets, flat files, and other systems with fields, cells, rows and columns are organized so a computer can understand.
Can be formatted as string (text), numbers, or dates that a computer can read
T/F: The fact that acquired data are structured, however, does not mean they are ready to use for analysis
TRUE
Structured data frequently contain errors, redundancies, and omissions
Structured data are based on data models that contain _____
metadata (data about the data)
provide context, meaning and purpose to data
Unstructured data
Do not conform to data models and associated metadata
Computers have a harder time reading this data
Include text files, pictures, audio recordings, webpages, social media content, and videos
Spreadsheets
a widely used business tool for storing and manipulating data
Stored in rows and columns
problems: storing data in spreadsheets is that since protection of the data is limited, users can easily introduce errors into formulas if they are unfamiliar with how a spreadsheet works
Lack of input control as well as lack of access control
Flat file
contains data in text format with no structured relationship among the data or to other files
.csv, ASCII, other delimited files
They are more frequently used to transfer data from one location to another
Ex: we can download data from one database into a flat file and then use that flat file to upload the data to another database
System configuration files are often stored in flat files
Amazon SimpleDB
Databases
Organized collections of data that enable users to access, manage, and update the data.
Most popular: Relational database (A collection of tables linked together via relationships)
Data model:
The structure of a database
In a relational database, the relationships between tables are created through the use of unique identifiers called_____
primary keys
Uniquely identifies each row in the table
Primary keys are references in other tables, where they are called
foreign keys
Four types of interactions possible in a database
Create new records or rows (modify)
Read records (has no impact)
Update or change (modify)
Delete records (modify)
CRUD
Anomolies
Irregularities if the database is not structured properly
a serious problem because they threaten data integrity
update anomalies
occur when the same data are stored in multiple places and therefore may or may not update correctly when a data value changes
insert anomalies
result when there is no place within the table to store the new data until another event occurs
Ex: the only place to store the customer name and address is in the sales transaction table
delete anomalies
occur when deleting some data results in the unintentional deletion of other data
EX: if we were to delete customer who has not purchased from us for a long time then their associated sales records would also be deleted
Normalization
the process of decomposing a database table into more tables until the database is no longer susceptible to modification anomalies
1NF, 2Nf, 3NF, BCNF (Boyce-codd normal form)
Business systems (transactional systems) are typically normalized up to the____ normal form
3rd
Most well known method for dealing with unstructured data
Tagging
tagged data
employ identifiers knows as tags that are attached to the data elements to them make them readable by a computer
Enclosed within <> brackets
Hyper Text Markup Language
uses tags to mark how content is structured within a webpage so that a web browser can process the tags and display the intended content
Extensible markup language (XML)
looks very similar to HTML, but it is used to describe data to both humans and computers.
a method of tagging or coding data in documents, so that they can be read by both people and computers.
T/F: Unstructured data may be un-understandable to a computer in its native form
TRUE
XML tags are used to
create metadata about data so that the data can be understood by computers for further processing and structuring.
extensible business reporting language (XBRL)
developed by accounting professionals to facilitate data sharing for reports and analysis
Basically any activity that requires communicating unstructured data to a computer and a structured taxonomy of tags can use XBRL.
XBRL then must be stripped of its tags to be used in analysis
XRBL converts blocks of text to content of meaning to a computer
Natural language processing (NLP)
People speak and the computer translates into commands so that it can understand
EX: Python package NLTK
NLP can be employed by analysts to convert source data into machine-understandable data
Often considered a type of AI
Image recognition
cans a picture and translates what it ‘sees’ into a textual description of whatever is depicted in the picture
transactional systems
store and process business data required for each of the businesses operations
designed to process transactions quickly, reliably, and accurately
Most transactional systems are based on an underlying relational database
Configured to three-tiered architecture
Transactional systems are also called
online transaction processes (OLTP)
enable them to support high-volume business transactions
Transactional systems generally are configured to use a three-tiered architecture that consists of the following components:
The user interface or presentation tier (most users will see, use and understand the transactional system only via this interface)
The business services, business logic, or application tier (business logic tier, application tier, middle tier, logic tier)
Can be used to enforce data rules
The data services or data storage tier
Represent the layers of the application and are logical rather than physical
Business rules
the logic by which business data operated
Include workflow, business processes, and user roles
Typically resides on a seperate server machine
The Database management system (DBMS) resides at the
data services or data tier
Where data are stored and accessed by the business services tier
Characteristics or transactional systems
Availability
Level of detail
Updatable
Speed
Current
Operational
Concurrent
Support Requirements of business processes
Small uniform transactions
optimized for storage
Data are functionally or process oriented
Enterprise resource planning (ERP) systems
Integrated transactional systems that enable all the functional areas of a business to share data
benefits of an ERP:
Transactional data need to be entered only once and then can be shared across all pertient areas
Changes made to master data are entered only once then used many times
This is not the case with non-integrated systems
The data processing and storage functionality of all the business processes are consolidated in a single system
Informational systems
are used to provide a place for data to be stored and prepared for analytical purposes
users can access to make data-driven decisions
optimized for read-only and therefore frequently separate from transactional system
Informational systems are sometimes referred to as
Online analytical processing (OLAP)
Contain large quantity of data that can be from multiple sources
Both data mining and analytics may be accomplished via OLAP
Characteristics of informational systems
level of detail
periodic
requirements are not always knows
managerial requirements
Optimized for access
Historical data
data may be integrated
availability
Out-of-date computer systems are referred to as
legacy systems
Web service
an XML-based software system that enables users to access computing resources via a network
Simple object access protocol (SOAP)
Web services are application components that communicatre via open protocols
Allow different systems to communicate with each other
EX: SOAP enables a windows based system to share data with a Unix system
T/F: A key characteristic of web services is that they have no user interface
TRUE
Web Crawlers
Also known as info agents or web spiders
Search websites one page at a time for information
Typically ask permission before pulling data from site
Also used for web scraping
Web Scraping
process of searching for information on webpages and then stripping the html tags so the data can be stored in a structured format
Used for marketing purposes
Web scraping may be accomplished with the use of site-specific application programming interfaces (APIs)
Typically clickstream data are stored as:
Semi-unstructured data (contain both text and structured data stored automatically by the system)
Sensor data
gathered from devices such as heating units, vehicles, electrical transformers, satellites, airplanes, health monitors, etc
Have applications in many areas of life
Going to be ever more important as the IOT becomes more prevalent in our lives
Manufacturers use to monitor the health of their products
Sampling
the act of extracting only certain data values from a dataset (subset)
This approach is employed in situations where a sample of the dataset tells the same story as the entire dataset
Sampling is appropriate when
The analyst are certain that the sample is representative of the entire set
the source is too large for the planned analysis
The application specifically calls for a data sample (accounting)
Scaling
Standardizes data to a normal distribution
Necessary when the output of the analysis needs to fall within a range of values
Systems that examine events and transactions in real time are called
continuous audit modules or continuous audit tools
To collect data from the transactional system, an auditor will sometimes employ an
embedded audit module or EAM
identifies transactional data to identify abnormalities
Intelligent control agents
software processes that work autonomously with distributed system to control or run a system both with and without human intervention
Data may be collected automatically through:
Continuous monitoring, feedback mechanisms, control agents