Understanding Big Data: From Data to Big Data
Introduction
One of the primary drivers behind the popularity of big data is the improved availability of data, which has significantly altered data collection and analysis methods. Key considerations now revolve around the volume, velocity, and variety of available data. Managers and decision-makers recognize the importance of data for their business activities due to its pervasive use.
Data Defined
Data comprises facts like numbers, words, measurements, observations, and descriptions that provide information about individuals, objects, or observations.
The Oxford dictionary defines data as "the quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media."
Data is considered a valuable asset, and its effective utilization can provide a significant competitive edge.
Types of Data in the Big Data Era
The big data landscape encompasses diverse data types, each requiring specific tools and techniques for analysis and processing.
Sources of Data
Databases: Databases serve as a primary source of data, supported by technologies like Hadoop and SQL for data recovery and storage. These databases contain various types of information, such as web server logs and banking transactions.
Raw Data: Raw data is complex and requires specific processing to be usable for modeling problems.
Text: Textual data includes books, articles, HTML code, and DNA sequences, written in natural language.
Images, Audios, and Videos: These data types pose unique challenges for data scientists but are valuable sources of information.
Internet of Things (IoT): Connected devices with sensors generate large amounts of raw data.
The Data Revolution
Revolutions involve fundamental changes in technical systems that impact society. Digital information is predicted to increase tenfold every 5 years. Many large companies have datasets in the petabyte range, with potential growth to exabytes or zettabytes.
Data sizes are growing exponentially, driven by individuals sharing data online, companies collecting client information, and computer-controlled industrial and commercial processes.
Financial institutions, companies, healthcare providers, and administrations create vast amounts of data through interactions. Internet searches, social networks, GPS systems, and stock market transactions also contribute significantly.
Organizations managing large datasets daily use terms like terabyte, petabyte, exabyte, zettabyte, and yottabyte.
Big Data Definition
While data has existed for a long time, its rapid production and diverse forms have led to the emergence of big data. Big data is created digitally and collected automatically.
Six mechanisms for finding and utilizing big data include:
Using already collected data to improve existing processes.
Supporting new activities with existing data.
Building a business model based on big data resources (e.g., Amazon).
Federating data resources from multiple entities (e.g., hospital databases).
Collecting and organizing large amounts of data to benefit an organization and its clients.
Building big data resources from scratch when no prior data or technologies existed.
Data Units
1 byte = 8 bits
1 kilobyte (KB) = 10^3 bytes
1 megabyte (MB) = 10^6 bytes
1 gigabyte (GB) = 10^9 bytes
1 terabyte (TB) = 10^{12} bytes
1 petabyte (PB) = 10^{15} bytes
1 exabyte (EB) = 10^{18} bytes
1 zettabyte (ZB) = 10^{21} bytes
1 yottabyte (YB) = 10^{24} bytes
Big Data: The 3 Vs
The growth in data quantity, diversity, and access speed defines the 3 Vs of Big Data:
Volume: Refers to the size of the data.
Velocity: Refers to the data provisioning rate and the necessary response time.
Variety: Refers to the heterogeneity of data acquisition, representation, and semantic interpretation.
Definitions of Big Data
Gartner's Definition: "An information asset whose volume is large, velocity is high, and formats are various."
McKinsey's Definition: "Big data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze." This definition is subjective and evolves with technology, varying by sector based on available software tools and common dataset sizes.
Today, big data ranges from a few dozen terabytes to multiple petabytes across many sectors.
Elaborating on the 3Vs Model
The "3Vs" describe the transformative technologies that enable new approaches to data.
Volume: Big data involves enormous data volumes, with current discussions focusing on exabytes or zettabytes, a significant increase from megabytes stored on floppy disks a decade ago. The IoT, with its ubiquitous sensors, contributes significantly to the expanding digital universe.
Velocity: Velocity refers to the speed at which data is created, analyzed, and stored. The immediacy of data transmission necessitates faster reaction and anticipation velocities for companies.
Variety: Variety encompasses the different types and sources of data, including emails, photos, videos, monitoring devices, PDFs, and audio. This diversity poses technological challenges and requires specific analyses for each data type. Various data types need storage, mining, and analysis, regardless of format.
The Significant Use of Big Data
The value of big data lies in identifying useful data and transforming it into usable information through pattern recognition, new algorithms, tools, and project solutions.
Big data represents a revolution in data analysis, enabling the processing and analysis of all data types in their original form, integrating new methods and ways of working.
The 3 Vs fundamentally change how data is addressed, placing it at the center of transformation.
Transforming data into information and then into knowledge is crucial for business success.