Data Science Notes

Introduction to Data Science

  • Definition: Data science is a multidisciplinary field using statistical and computational methods to extract insights and knowledge from data.

  • It combines skills from statistics, computer science, mathematics, and domain expertise.

  • Data science involves data gathering, analysis, and decision-making.

  • It focuses on finding patterns in data through analysis to make future predictions.

    Companies can leverage data science for:

    • Better decisions.
    • Predictive analysis.
    • Pattern discoveries.

Where is Data Science Needed?

  • Banking

  • Consultancy

  • Healthcare

  • Manufacturing

    Examples:

    • Route planning for shipping.
    • Foreseeing delays for transportation (flights, ships, trains).
    • Creating promotional offers.
    • Optimizing delivery times.
    • Forecasting revenue.
    • Analyzing health benefits of training.
    • Predicting election outcomes.

    Data science can be applied in various business sectors:

    • Consumer goods
    • Stock markets
    • Industry
    • Politics
    • Logistic companies
    • E-commerce

Evolution of Data Science

  • 1962: John W. Tukey envisioned the field in "The Future of Data Analysis."
  • Peter Naur defined data science as "The science of dealing with data, once they have been established, while the relation of the data to what they represent is delegated to other fields and sciences."
  • 1977: The International Association for Statistical Computing (IASC) was established to link statistical methodology, computer technology, and domain expertise.
  • 1980s-1990s: Emergence of Knowledge Discovery in Databases (KDD) workshops and the International Federation of Classification Societies (IFCS).
  • 1994: Business Week reports on "Database Marketing,” highlighting the collection and leveraging of data for business insights, but also noting the challenge of managing massive data.
  • 1990s-early 2000s: Data science emerges as a recognized field with academic journals and proponents like Jeff Wu and William S. Cleveland.
  • 2000s: Technology advances provide near-universal internet access and data collection.
  • 2005: The era of Big Data begins. Google and Facebook generate large amounts of data, necessitating new processing technologies.
  • Hadoop, Spark, and Cassandra emerge.
  • 2014: Increasing demand for data scientists globally.
  • 2015: Machine learning, deep learning, and AI enter data science, driving innovations like personalized experiences and self-driving vehicles.
  • 2018: New regulations impact the field.
  • 2020s: Further AI and ML breakthroughs, with increasing demand for data science professionals.

Data Science Roles

Data Analyst

  • Entry-level position focused on collecting and analyzing company data to provide actionable business insights.

    • Responsibilities: Accessing and cleaning data, performing statistical analysis, visualizing and communicating results.
    • Programming languages: Python, R, SQL.
    • Tools/skills: Data science programming, probability and statistics, collaboration, communication.

Data Scientist

  • Builds machine learning models and works with algorithms to make predictions based on data.

    • Responsibilities: Analyzing data, building and training machine learning models.
    • Programming languages: Python, R.
    • Tools/skills: Skills of a data analyst, plus strong math, analytics, computer science, machine learning methods, statistical models, advanced data science programming, and Apache Spark.

Business Analyst

  • Communicates data insights into actionable business strategies.

    • Responsibilities: Communicating initiatives using data-driven insights, acting as a liaison between business and tech teams.
    • Programming languages: SQL, Tableau.
    • Tools/skills: Understanding business processes, data visualization tools, listening and storytelling, data modeling.

Software Engineer

  • Optimizes product features based on user data or builds custom software.

    • Responsibilities: Collaborating with data scientists and business analysts to align business objectives, ensuring scalability and security.
    • Programming languages: Java, Python, C, C++.
    • Tools/skills: Machine learning and deep learning frameworks, mathematics (linear algebra and statistics), programming and debugging, data processing, writing and communication.

Marketing Data Scientist

  • Analyzes data to inform marketing strategies and measure campaign outcomes.

    • Responsibilities: Strategizing the launch and evolution of marketing campaigns, communicating between stakeholders.
    • Programming languages: SQL, Python, R, Tableau.
    • Tools/skills: Data analytics, objective thinking, strong communication and adaptability.

Machine Learning Engineer

  • Applies machine learning algorithms to datasets.

    • Responsibilities: Processing data using machine learning algorithms to drive business decisions.
    • Programming languages: R, Java, Python, C++.
    • Tools/skills: Communication, data structures, vectors, matrices, derivatives, integrals, statistical concepts and probability theory.

Stages in a Data Science Project

  • Problem Statement: Clearly define the problem to add value to the business.
  • Data Collection: Research and gather necessary data from various sources (structured or unstructured).
  • Data Cleaning: Remove missing, redundant, unnecessary, and duplicate data.
  • Data Analysis and Exploration: Analyze data structure, find hidden patterns, study behaviors, and visualize variable effects.
  • Data Modelling: Choose the best-fit algorithm (regression, classification, SVM, Clustering) and train with training data, testing with test data (e.g., K-fold method).
  • Optimization and Deployment: Test model accuracy, optimize for better prediction, and deploy for external use, gathering feedback for further improvement.

Applications of Data Science

  1. Search Engines: Used to provide faster and more relevant search results.

    • Example: Displaying topmost visited web links for a given search query.
  2. Transport: Applied in driverless cars to reduce accidents.

    • Example: Analyzing data on speed limits, traffic conditions, and driving situations.
  3. Finance: Automates risk analysis and predicts future trends.

    • Example: Predicting customer lifetime value and stock market movements.
  4. E-Commerce: Enhances user experience with personalized recommendations.

    • Example: Suggesting similar products based on past searches and purchases.
  5. Health Care: Used for detecting tumors, drug discovery, medical image analysis, virtual medical bots, genetics, genomics, and predictive modeling for diagnosis.

  6. Image Recognition: Used in image recognition for tagging.

    • Example: Facebook's auto-tagging feature.
  7. Targeting Recommendation: Provides targeted advertisements.

    • Example: Displaying ads for a mobile phone after the user searches for it online.
  8. Airline Routing Planning: Predicts flight delays and optimizes routes.

  9. Data Science in Gaming: Improves computer opponent performance using past data.

    • Example: Chess, EA Sports.
  10. Delivery Logistics: Finds the best routes, delivery times, and modes of transport.

    • Example: DHL, FedEx.
  11. Autocomplete: Provides auto-completion suggestions in various applications.

    • Example: Google Mail, search engines, social media.

Data Security Issues

  • Data Storage: Adopting cloud data storage introduces security risks.

    • Solution: Combining on-premise and cloud storage to balance security and flexibility, employing cybersecurity experts for on-premise databases.
  • Fake Data: Inaccurate information can lead to unnecessary actions and reduced production.

    • Solution: Validating data sources and evaluating machine-learning models to find anomalies.
  • Data Privacy: Protecting sensitive information from cyberattacks and data loss.

    • Solution: Implementing strict data privacy principles, access management services, and following rules for data handling and network security.
  • Data Management: Security breaches can compromise critical business information.

    • Solution: Deploying highly secured databases with access controls, practicing data encryption, segmenting and partitioning data, securing data on the move, and implementing a trusted server.
  • Data Access Control: Managing which data users can view or edit.

    • Solution: Shifting to cloud-based Identity Access Management (IAM) and following relevant ISO standards.
  • Data Poisoning: Tampering with training data of machine learning models.

    • Solution: Using outlier detection to separate injected elements from the existing data distribution.
  • Employee Theft: Employees leaking sensitive information.

    • Solution: Implementing legal policies, securing the network with a virtual private network, and using Desktop as a Service (DaaS) to eliminate local data storage.