The term “big data” refers to datasets that are so large or complex that traditional data processing applications are inadequate to handle them effectively. This complexity arises primarily due to the challenges associated with its three defining features commonly known as “the three Vs”:
Volume: Big data encompasses vast amounts of information, often exceeding terabytes or petabytes of data, making it impossible to store efficiently on conventional hard drives or even standard server configurations. To accommodate this data, organizations often utilize distributed storage systems that spread data across multiple servers, each comprising numerous hard drives.
Velocity: This aspect addresses the speed at which data is generated and processed. In many instances, data flows in at a rapid pace from various sources such as social media, online transactions, sensor data, and more. Therefore, systems must be capable of processing and responding to this incoming data in near real-time, often within milliseconds, to remain relevant and insightful.
Variety: Big data consists of various forms of data, including structured data (like databases), semi-structured data (like JSON or XML), and unstructured data (like text documents, images, videos, etc.). This diversity presents challenges in data integration and analysis, as conventional databases typically require a predefined schema involving rows and columns.
While the sheer volume of big data might seem like the most significant challenge, it is, in fact, the unstructured nature of big data that complicates its analysis. Traditional database systems are ill-equipped to manage big data due to their reliance on structured formats and their inability to scale efficiently across numerous servers. To derive actionable insights from big data, advanced techniques such as machine learning and data mining are essential for identifying patterns and relationships within the data.
Examples of big data can be found in various sectors, with banking institutions employing continuous monitoring of transactions and surveillance systems collecting massive data flows. Given that big data is often distributed across multiple servers, processing must also be decentralized. This distribution of processing responsibility is challenging under conventional paradigms, requiring synchronizations to prevent data overwriting or loss.
Functional programming emerges as a robust solution to the difficulties of distributed data processing. Programs written in functional languages are stateless (i.e., have no side effects) and leverage immutable data structures. This characteristic, alongside the capacity to support higher-order functions, renders functional programming more effective for developing correct, efficient, and distributed code compared to traditional procedural programming techniques.
Since big data does not conform to the normal table-like format, alternative models for representation have been developed. One such model is the fact-based model, where each data point is recorded as an immutable fact. Each fact is timestamped, ensuring that values can be stored without overwriting previous data entries. This approach not only prevents data loss due to human errors but also simplifies data storage by appending new entries rather than requiring complex indexing. An illustrative example is the color of a house stored as two facts, highlighting its immutability through timestamps which identify the most recent color change.
Another valuable representation method is graph schema, which visually describes data relationships through nodes and edges. Each node signifies an entity and can encompass its properties, while edges illustrate the relationships among entities, accompanied by descriptive labels. Unlike the fact-based model that focuses on timestamps, graph schemas often assume each node reflects the latest information. Entity properties can alternatively be depicted using rectangles linked to their respective nodes with dashed lines, indicating property ownership rather than relationship dynamics.
This variety of methods for handling big data showcases the need for innovative approaches in data science and information technology.