What is an API
Set of rules and instructions (Requests/Responses) that different systems/apps can use to communicate with each other.
Make it possible for different programs and apps to work together, share data, and accomplish things even if they were built by different people or companies.
Exposing some of the programs internal function to the outside world, to make it possible to share data and take actions (signing up with Facebook/Gmail account).
Benefit of APIs
let developers use features from existing services w/o having to build them from scratch.
Like borrowing ready-made tools to make creating new applications faster and easier.
Approaches to creating APIs
Design first: API is designed first using a service like Swagger or API Builder, this makes sure that it is highly consistent and reusable.
Code first: code is written first then a tool is used to generate the API specification from the code.
Middleware
Simplifies building distributed applications by providing ready-made tools and features that hide complex details.
Devs can focus on their tasks without dealing with every technical aspect of the application. Makes it easier to build software to call an API.
Types of middleware
Java RMI (Remote Method Invocation) - allows a program to call methods on Java objects running on other computers
SOAP web services - allow procedures running on other computers to be called.
REST web services - allow remote resources to be created, queried, updated or deleted.
Message brokers - special servers that allow multiple clients to send messages to each other (asynchronously).
Benefits of Middleware
Takes care of the lower level aspects, so devs can focus on building important aspects of their applications.
saves time and effort by handling tasks like data transfer, communication protocols, and managing connections between different parts of an application.
Microservices
a way of building an application by dividing it into smaller, self-contained parts. Each service, can run independently and communicate with other services using a simple and lightweight method.
Each service can be updated or changed without affecting the other services.
Microservices Pros
Ravioli code: are like individual compartments in a box of ravioli, easier to understand and work on different parts of the system without getting tangled up in complex connections.
Independent and flexible: each micro-service operates on its own. It can be deployed and updated separately, w/o affecting other services.
Technology diversity: devs can pick the best tools for each micro-service.
Microservices Cons
Harder to program: handle communication between services.
Remote calls are slower than internal method calls and they have a risk of failing.
Consistency problems: Changes made to one service may not be immediately reflected in other services, leading to potential consistency problems.
Operational complexity managing deployment, monitoring, and scaling multiple services adds another layer of complexity. (need for experienced developers)
Motivation for REST web services
Scalability Users might want to use your services for other services which they can do with this web service
It allows for it to be easily scaled if heaps of users come in 10 years time that you cannot plan for
safe and idempotent in http
Safe: it shouldn’t have any side effects on the server or the data it operates on. (read)
Idempotent: send it multiple times and it won't do anything bad
GET
Used for requesting and retrieving data across the web
A safe and impotent operation
send it 10x and it won't do anything bad
POST
Used for uploading / submitting a file or form to the web
Not safe or impotent
alters data and submitting it multiple times can display different outcomes.
PUT
Used for updating something at a certain URI
Idempotent but not safe
same request produces the same result, while not safe means it can have side effects on the server or dat.
DELETE
Used for removing something from the web
Impotent but not safe.
It can have side effects on the server or data, potentially causing permanent deletion/modification of resources.
Addressability (REST principle)
Each resource in a web service has a unique URL, making it easy for clients to locate and interact with specific information and functionalities. It enhances organisation and accessibility.
e.g., For an online store, each product has its own address (URL). This lets clients easily access and perform actions on specific products
Statelessness (REST principle)
Every HTTP request to the server contains all the needed information, so the server doesn't remember any past interactions or session data from the client.
e.g., users must provide their login credentials with each action they take, and the server verifies these credentials for authorisation before processing their request.
Uniform Interface (REST principle)
Web services use a common language (methods like GET, POST, etc. and data formats like JSON or XML) to communicate and share information. This makes it easier for different systems to understand and work together.
Connectedness (REST principle)
Hyperlinks in the API responses make it easier to explore and access related resources, providing clear navigation and interaction guidance.
e.g., including hyperlinks, server guides clients on how to navigate the application's functionality and access relevant resources.
Statelessness benefits
The server never has to worry about clients timing out as interactions only last one single request
Server never loses track of where each client is because the client sends all necessary info with each request
Relational database
have a fixed structure called a schema and use SQL for managing data. They are easy to work with, have predictable organisation, minimal data redundancy, and support efficient querying using SQL.
Semi-structured (data organisation)
Standard structure or format but the schema is flexible and is sometimes self-describing.
Structured text like JSON, XML, YAML
Unstructured (Data Organisation)
Implies no specific schema or data model. Free form text or binary data (Word, PDF)
Data lake
a storage repository that holds a vast amount of raw data in its original format until the business needs it. No structure at all.
Can easily become a "data swamp" with no meta-data, irrelevant data, no automation and poor cleaning
Parallel processing
Computing multiple parts of a task at the same time to increase performance. Any part that cannot be done in parallel will become a bottleneck. The data for each parallel process should be stored locally.
Amdahl's Law
A rule stating that the performance enhancement possible with a given improvement is limited by the amount that the improved feature is used. It is a quantitative version of the law of diminishing returns. Applies to any part of the system that cannot be done in parallel. Diminishing returns for throwing hardware at a problem
Batch processing
Occurs at regular intervals (monthly, every GB, every 5 million clicks)
Data is not always up to date
Processing is not as time-constrained
Probably a better option if data comes in chunks already
Real-time processing
Processed as it arrives
Tight time constraints
Immediate response needed
Susceptible to 'event storms' Needed for data streams like news feeds etc.
Operational Databases
Day to day operations, transactional databases.
Relational
Mostly update commands used on it
Simple and predictable queries
High level of detail in the transactions
Functional and process oriented
Analytical databases
For management decision making
Relational and non-relational, denormalised
Data warehouses
Complex, ad hoc queries
Data Aggregation on variables (dimensions) based on facts
Lower level of detail, looking at a broader scope
Data warehouse
Designed and optimised for analytical purposes
Schema is very different from an operational database
Organised around business metrics and facts
analysis conducted through dimensions
Aggregated at many levels of detail, caring about the trends in the data not so much individual transactions
Usually are massive (Petabytes)
Traditional DB vs Data Warehouse
Timespan - Op DBs focus on current transactions. DWs take a longer view
Granularity - Op DBs have a fine level of detail. DWs aggregate at various levels; may only include aggregate data
Dimensionality - Operational tables are "flat". DWs aggregate data by many dimensions
Op DBs focus more on data (attributed of product for example)
DWs focuses more on results, answering queries is the whole point
Facts (Warehouses)
The values that we are interested in. The measure or the dependent variable. They are usually aggregate numbers like total revenue, average profit. Simple values like price, cost, GST, Profit. Attributes we use in calculations.
Dimensions (Warehouse)
Influence our view of the facts. The factor or the independent variable. Time is a key dimension. They are usually internally hierarchical (day, month, year). Dimensions are usually used to filter the facts
Time dimension
Not as simple as it seems, it has high granularity (unit size) eg. year, month, week, day etc. Alternative units, season, quarter. Inconsistencies, fiscal years across the world, months different length, leap years. TIME ZONES
Star Schema
Central fact table, cluster of related dimension tables (relational DB). Each row represents a combination of dimension values. A partially denormalised schema is the essence of a star schema.
Slowly changing dimensions
is the technique use to manage attribute changes in a dimension over time.
Retain the old history and leave the data as it is
overwrite with new data
row versioning
extra columns
Dimensional Hierarchy
Dimension tables are usually denormalised, different levels of aggregation are needed.
Base fact table always aggregated at the highest level of detail across all dimensions
Pre compute commonly used levels of aggregation
Problem: too many fact tables
How many fact tables to cover all possible combinations of dimensional aggregation levels
Options:
Pick the ones that are used more commonly and precompute those
Create additional fact tables as materialised views derived from the base fact table with query rewrite enabled
OLAP
Online Analytical Processing For huge data pre-processed and summarized user reports fast no access to details
OLAP tools
Data transformation. Business modelling. Statistical analysis. Powerful GUI query facility. Visualisation (graphics)
Drilling down
a dimension hierarchy provides a more detailed view of the facts (aggregation by smaller units).
Rolling up (data warehouses)
Similar to drilling up but it collapses the data from multiple items into a single value. Rolling up a dimension hierarchy provides a more summarised view of the facts (aggregation by larger units).
OLAP storage
Internal proprietary DB (MDD). relational (ROLAP). multidimensional (MOLAP). Both (HOLAP).
Analytic query (warehouses)
Ranking queries Running / cumulative totals Computing multiple aggregates with different groupings in one operation Windowed queries First / last / nth value Converting rows to a list
Some in plain SQL
Analytic function
Enhanced version of aggregation. Simple aggregate functions are affected by 'group by' and summarise multiple rows to a single value.
Analytic functions never are impacted by 'group by' and summarise multiple rows but dont reduce number of rows in result
GraphQL
An alternative to REST
A query language for APIs and a runtime for fulfilling those queries with your existing data. Provides a description of the data in your API, gives your clients the power to ask for what they need and nothing more, makes it easier to evolve APIs over time and enables powerful developer tools.
GraphQL needs a little bit more to run (middleware) rather than a typical REST system that is run along with the HTTP
GraphQL vs REST
REST offered great ideas initially (Statelessness, structured access to servers) but is too inflexible to keep up with changing requirements.
A clearly defined schema makes it easier for front and backend teams to work since they both know the definite structure of the data that is sent across the network.
No more over and underfetching of the data
GraphQL is more efficient than REST or RPC APIs
Underfetching (REST vs. GraphQL)
a specific endpoint doesn't provide enough of the needed info. The client will need to make more requests to fetch everything it needs. E.g. having to send 20+ requests to get information about a football team
GQL queries can include nested structures (eg. team roster) and all matching data will be returned. This is all done with a Single URL and single request, less time waiting and less processing required
Overfetching (REST vs GQL)
A REST GET call returns the same representation whether the client is a desktop or mobile etc. a mobile might need to filer data it doesn't need this is a waste of bandwidth and time.
GQL queries state what data fields they want.
Selecting and filtering fields for data retrieval
Specifies precisely what the client wants
No need for filtering within the API which can get far too complex
You can specify the name of the query so that if it fails you can see for what query went wrong, good for documentation
Server side: Resolves / data fetchers
queries are parsed
validated (against the schema)
executed
Every field on every type is backed by a function called a resolver. A resolver is a function that resolves a value for a type or field in a schema.
Resolvers can be asynchronous ... They can resolve values from another REST API, database, cache, constant, etc
Enterprise Application Integration (EAI)
EAI is a problem to be solved in business, developers etc. have solved the problem by creating message brokers.
Connects the plans, methods, and tools aimed at integrating separate enterprise systems which there may be hundreds if not thousands of custom built or off the shelf systems
(EAI Problem #1) if a process takes a while, how do we not waste time?
Communicate via asynchronous messages
When there are multiple asynchronous requests sent, how do we make sure the response is matched with the right request?
Need some sort of reference number, something needs to be included in the message that is a reference
The solution to this problem is to send messages to Queues within a messaging server which sends them out to the location
(EAI Problem #2) business processes change, how do we futureproof?
Send messages to named channels via a commonly known messaging server (or, in larger organisations, a federated collection of message servers).
This means that as old apps are replaced with new ones, messages still get sent to the same place.
Message Oriented Middleware (MOM)
Supports asynchronous communication between applications using structured messages. Supports documents to be processed.
Motivation for MOM
RPC or RMI
sender and reciever needs to be available at the same time
sender must know the methods provided by the recipient
tight coupling
MOM
sent to a queue, recipient can retrieve at any time
message format must be understood by both but there is loose coupling and usually represents a type of business data
(EAI Problem #3) Independent programs can be run on their own but others might not be running too.
Ensure your messaging server supports "store and forward" messaging. Messages are stored in a database until they are successfully delivered
(EAI Problem #4) Apps may use different data representations. How do we keep loose coupling but still allow for translation of messages?
Ensure your messaging server is a message broker that allows message transformation rules to be defined for particular channels within the server
Use a message routing and mediation engine (e.g. Apache Camel) that can read and write to message brokers, web services, etc., and contain application logic to transform and route messages.
(EAI Problem #5) How do we avoid configuring the messaging server with a channel for each message type?
Ensure your messaging server is a message broker that allows content-based routing rules to be attached to channels within the server.
Use a message routing and mediation engine (e.g. Apache Camel) that can read and write to message brokers, web services, etc., and contain application logic to transform and route messages
The Java Message Service (JMS) API
JMS is a Java app programmer interface for clients of a MOM to send and receive messages
Enterprise Integrations Patterns (EIP)
65 design patterns for solving basic problems that commonly arise in messaging based EAI
Each pattern has a graphical icon so that diagrams can be used to show integration logic Makes it easier to visualise what the pattern is High level
The patterns can be combined together to solve complex integration problems
Service integration middleware
Allows for a connection of disparate apps and services together.
Provides tools to enable the services to communicate with eachother.
Allows us to coordinate the communication between services
Apache Camel
Open source integration framework. Allows for the construction of a business process or service that integrates many types of application, service and data source.
Support for many message formats (XML, JSON etc.
Endpoints
A channel for messages to enter or leave camel routes. An endpoint can be a consumer or a producer or both Defined by endpoint URIs
http://
jdbc:
jms:queue:blahblah Endpoints are implemented by Camel components
Processors (Message oriented middleware)
some Java DSL methods are predefined processors that perform operations on messages
The Exchange class
represents a message exchange along a route
Message processors can read and change the contents
MEP
Message Exchange Pattern. Most commonly the values are InOnly and InOut
Message bodies
Camel can store any data structure within a message body (String, XML, JSON, Java objects
Meaning that Camel is unopinionated
The processing pipeline
The consumer (from()) endpoint creates the exchange object to represent a received message. It sets the MEP to InOut if it expects a response otherwise it is InOnly. The out message from one processor becomes the in for the next processor
Message broker with Camel
Supports multiple clients
Messages are persistent
We can look at the contents of queues and messages using ActiveMQ
We can test individual routes by manually adding messages to queues using ActiveMQ
Cloud and Virtualisation
Cloud is basically virtualisation, your services will be running on some form of virtualised environment.
Deploying a service to a cloud platform will usually result in
A machine / container image being retrieved or created
A hypervisor booting that image to create a virtual guest
The guest may install some additional software it needs to run the service
The guest will start your service
Hypervisor
Software that provides virtual hardware to virtual machines.
Translates virtual instructions into real instructions on the physical hardware
Allows the booting of multiple guest OS that can run at the same time
each guest is isolated from the others
Provides features for managing VMs such as making snapshots of virtual disks and memory for backup and migration
Virtualisation benefits
Make more efficient use of existing IT infrastructure. In the past, a server would be dedicated to running one app. Able to use the infrastructure to its full potential.
Cut down OPEX and IT expenditure.
More eco-friendly.
Reduces downtime
Provisioning
Installing and configuring environments using simple, repeatable and consistent configuration
Server configuration
Installing and configuring the software on the servers needed for hosting the service. The same configuration should be used for different hosts.
This config should be repeatable (Puppet, Terraform)
Monitoring / management
tools that monitor the state of servers and services and warn admins when something is wrong. allow admins to configure remote servers usually via a web app.
IaaS (Infrastructure as a Service) platforms
Platforms that provide a complete stack for creating, running and managing IaaS servers. allow admins to provision, run, and monitor virtual servers via a web app. Redcuded risk compared to using the platforms providers offerings as if you dont like it anymore you can move your machine images to another.
PaaS platforms
Platforms that provide a full stack of servers and libs for developing services. Web servers, DBMS servers. Can be deployed to virtual servers on any of the big providers
DevOps
An approach based on lean and agile principles in which business owners and the development, operations (hence DevOps) and quality assurance departments collaborate. Aimed at shortening systems development life cycles.
Infrastructure as Code
The idea of using a declarative config file to provision your servers, allowing you to automate provisioning. Declarative meaning what you describe what you want rather than providing commands that perform the config, the tools work it out for you. Aim of DevOps is to have the entire infrastructure automatically provisioned as source code. Services can be deployed repeatably and reliably no matter where you start from. All information about how various systems are configured is in the IaC scripts. Versioning can be used.
Container orchestration
The automated deployment, scaling and management of containerised applications
Addresses issues such as:
How do I distribute replicated applications across multiple servers
What order do apps need to be started
How to make sure apps keep running And how they recover from crashes automatically
Kubernetes (k8s)
an open source system for automating deployment, scaling, and management of containerised applications.
K8s clusters
Cluster: a set of computing nodes (physical or VMs) that can run containers under the management of K8s.
K8s namespaces
Separate working areas within a k8s cluster. Not needed if you are the only one using a cluster If there are more than one you need to specify your name space -ns (name) operation
K8s Pods
K8s manages pods not individual containers. A pod can contain 1 or more containers. Provides a way to set environmental variables, mount storage and feed info into a container
Containers
similar to virtualisation but there is no hypervisor and no guest OS. The guest only has the libs and the programs needed for the service to run. The guest piggybacks of the hosts kernel. Much smaller images and higher performance than VMs
LXC containers
Basic Linux container. Provides the engine that allows containers to run in a way that they are isolated between the host and other containers. Can be standalone
LXD
Runs ontop of LXC and extends it. Focus is on containerising entire Linux servers. Provides features that are similar to a hypervisor. Better performance though
Docker
Runs ontop of LXC and extends its capabilities. Its focus is on containerising single applications/services. There is a hub like GitHub for hosting docker images. Most large server software projects have an official docker image. Uses the concept of layers
Criu
Checkpoint / restore in userspace allows snapshots of running containers
Shared everything architecture
disk and memory shared between nodes, access to shared resources becomes a bottleneck at scale (huge DBS). Not completely independent (Parallel)
Scales follow amdahl's law
Shared Nothing architecture
No shared resources and no dependency between the nodes. Scaling is linear and unlimited. data consistency can become a problem
Multi-node queries may require copying data between nodes
Scale up vs Scale out
Scale up, add more resources like CPU and memory to individual nodes. Compatible with SE and SN
Scale Out, add more nodes to the system, more web application, DB, servers Compatible with SN but problematic with SE
NoSQL
"not only sql" a broad class of non-relational DBMSs, inspired by growing infrastructure needs on the web. Made use of Flexible schemas
Flexible schemas
No schema at all Schema defined by data (self defining, JSON and XML) Entity attribute structure (EAV) All have implications for integrity, consistency, querying
SQL problems
impedance mismatch between SQL and typical programming environments.
SQL is not well designed but is the standard. Hard to write and debug.
ACID is difficult to enforce on distributed node systems
NewSQL
A database model that attempts to provide ACID-compliant transactions across a highly distributed infrastructure.
Each type focusses on solving a specific kind of problem Key-value data stores Column-oriented data stores Document oriented data stores (BigTable)
Column-oriented data stores
Extensible columns of closely related data. Both rows and columns can be split over multiple nodes.
Both rows and columns can be split over multiple nodes
Google BigTable
Shared nothing
Supports single row transactions
Simple queries
No secondary indexes
Only one server is responsible for a given piece of data
Document oriented data stores
Organises data as a collection of documents
(MondoDB)