AT

465 Data Applications --- Terms and Concepts to Review

GENERAL:

R programming language- language and environment for statistical computing and graphics, data analysis and

the development of statistical software.

Hypertext Markup Language v5- defines content and structure of web browser content.

Java- Android language designed to have few implementation dependencies as possible.

SWIFT – Apple program language.

C++- general purpose programming language (developing operating systems).

CSS- Cascading Style Sheets language used for specifying presentation and style of a document.

Pipe tool in R- (%>%) chains multiple operations together- uses the output of the expression on left and

becomes the first argument of the function to the right.

tidyverse package Tidyr- provides a set of functions that help you get to tidy data (consistent form data).

tidyverse package readxl- importing Excel into R program / R studio.

YAML- human readable data serialization language. Configures files and where data is stored in applications.

Metadata- data that provides information about other data. Descriptive metadata provides information about

a resource.

Calculated field- new field that performs calculations based on existing fields in your Pivot Table (Excel).

Algorithm- process or set of rules to be followed in calculations especially by a computer.

Pivot tables- (excel) powerful tool to calculate, summarize, and analyze data to visualize patterns & compare.

Code chunk- (R markdown) runnable piece of R code. History of inputs that can be shared with other analyst.

Jupyter Notebook- (python) web based interactive computing platform (live code, equations, narrative text).

Rich Text Format- file format that enables exchanging text files between different word processing programs.

Dashboard- type of graphical user interface which provides info of key performance indicators.

GitHub document- management of documentation for software projects (created and hosted by github)

dot RMD format- (R markdown) file format combines markdown text with embedded R code chunks. (

Reproducible research and data analysis, as it allows you to write narrative text, including code and generate

dynamic reports or documents.

the knit option- (code chunks) ability to convert a file from dot RMD format to another file type to be shared.

Geoms layer- combines data, aesthetic mapping, an object and a position adjustment. (creates visualizations)

Annotate layer- layer used to display text about features on a map. (comments section for visualization)

tidyverse package ggplot2- (visualizations) transforms and creates the different plots (box plot, scatter plots,

counts chart, bar graphs, histograms…),

Functions in the ggplot2 package such as:

ggplot()- provide the data to be inserted in a graph or chart.

geom_boxplot()- boxplot compactly displays the distribution of a continuous variable. It visualizes five

summary statistics (the median, two hinges and two whiskers plot.

geom_line()- creates line graphs. It enables us to specify which variables in the data should connect to which

parts of the plot.

geom_bar()- makes the height of the bar proportional to the number of cases in each group.

geom_point()- creates layer of points for the scatter plot.

Microsoft Excel tools/functions:

Data Validation tool- restrict the type of data or values that users enter in cells.

LEN- length of the string. returns the number of characters in a given cell.

CONVERT -number from one measurement system to another.

MIN, MAX- minimum and maximum functions within Excel.

VALUE- convert a text string that represents a number into a number.

VLOOKUP- when you need to find things in a table or a range by row.

HLOOKUP- used to search for a value in the top row of an array of values, and then it retrieves the value in the

same column from a row you specify in the table or array.

TRIM- eliminate extra spaces within a text string

SUMPRODUCT- allows you to multiply ranges of cells and arrays and returns the product's sum.

COUNT- to get the number of entries in a number field that is in a range or array of numbers.

COUNTIF- count the number of cells that meet a criterion.

COUNTIFS- counts cells in a range based on one or more true or false condition.

MATCH - searches for a specified item in a range of cells, then returns the position of that item in the range.

SUM- performs the addition operation.

SUMIF – adds the values if certain criterion is met.

SUMIFS - adds the values if certain multiple criterions are met.

Structured Query Language (SQL) functions:

DATETIME format: YYYY-MM-DD HH:MI:SS

CONCAT_WS function adds two or more strings together with a separator.

JOIN combine rows from two or more tables, based on a related column between them.

INNER JOIN selects records that have matching values in both tables.

OUTER JOIN used to return all the records that have values in either the left or right table.

AS used to rename a column or table with an alias.

FROM specify which table to select or delete data from

WITH allows you to give a sub-query block a name (a process also called sub-query refactoring)

ORDER BY keyword sorts the records in ascending order by default. (type DESC for descending)

GROUP BY statement is often used with clustered functions ( COUNT() , MAX() , MIN() , SUM() , AVG() )

to group the result-set by one or more columns.

Basic functions in R:

mean() method calculates the mean (average) of the given data set.

median() calculates in R to find the middle most value in a data series.

sd() calculates the standard deviation within R.

var() calculates the variances within a number set.

max() returns the maximum of a specified table or list.

min() returns the minimum of a specified table or list.

quantile() divides the data into equal halves, in which the median acts as middle and over that the remaining

lower part is lower quartile and upper part is upper quartile.

summary() useful to quickly summarize the values in a vector, data frame model in R.

typeof() provide useful information about your vectors and R objects in general.

sum() returns the addition of the values passed as arguments to the function.

print() generic function which means that new printing methods can be easily added.

range() returns a vector containing the minimum and maximum of all the given arguments.

str() displays the internal structure of an object such as an array, list, matrix, factor, or data frame.

ncol() returns the total number of columns present in the object.

length() used to find out how many items are present in a vector.

annotate() allows to add all kind of shape on a ggplot2 chart.

names() give the names of the corresponding list (in environment pane).

unite() convenience function to paste together multiple variable values into one.

split() takes a vector or other objects and splits it into groups determined by a factor or list of factors.

Pipe operator (%>%) way to chain multiple operations together in a concise and expressive way.

bias function the average amount by which actual is greater than predicted (closer to 0.0 less bias).

** in dot RMD format adds a Bold Font.

Add [r] to begin code chunk to load two packages in R markdown.

Jupyter Notebook tool used with files that executes everything as code sequentially with R and Python.

YAML helps edit structures reference data that sorts and identifies aspects of the information.

Code chunks show the actual code embedded in a file, its output, evidence for the findings, and the sources.

Make the code dynamic to increase readability. Concise language (combine similar lines of code).

Use annotate() to draw shapes and arrows to a plot built using ggplot2.

To use JOIN function, know which table you are using all & which table you are borrowing from (left or right).

WITH is a clause used to create a specific rows and columns that executes the same process as a temp table.