Glossary – Data Science

A/B Testing

The practice of testing 2 variants of a single variable (eg a webpage) to determine which one is better

 

Activation Function

A function that maps an input signal of a node in an Artificial Neural Network to an output signal. This output is then used as the input for the next node

 

Artificial Intelligence

An area of computer science that focuses on the creation of intelligent machines that operate and react much like humans do. Such machines process inputs from the environment using experiences it has gained to take intelligent decisions

 

Autoencoders

A neural network whereby the unsupervised learning algorithm is trained to set the output target values are equal to the inputs. The autoencoder’s goals is to reduce the number of random variables under consideration, such that the input is represented in fewer dimensions (see Dimension Reduction).

 

Backpropagation

Refers to the “backward propagation of errors”. A technique used in artificial neural networks to calculate a gradient for updating the weights and biases in the network. The error is computed at the output and distributed backwards throughout the network’s layers.

 

Bayes’ Theorem

A mathematical formula for determining conditional probabilities – P(A|B) = P(B|A) / P(A)*P(B). The formula gives the probability of an event when we already have information about some other condition related to the event.

 

Bayesian Network

A probabilistic graphical model for representing multivariate probability distributions. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given the symptoms, the network can be used to work out the probabilities of the presence of various diseases. Also known as “belief networks” or “causal networks”

 

Bias

In machine learning, bias occurs when results are systematically prejudiced because of errors in the machine learning process that cause the learner to consistently learn the same wrong thing.

 

Big Data

Collections of data – both structured and unstructured – once deemed impractical due to their volume, velocity, and variety, and could not be processed by traditional processing applications. Big Data can now be processed and analyzed to generate insights from the data and make better decisions.

 

Chi-Square Test

A statistical hypothesis test to determine if two categorical variables are related in a population.

 

Classification

The process of determining the class (or targets, labels or categories) of given data points. Classes are sometimes called as targets/ labels or categories

 

Clustering

The process of organizing data into groups such that all members of each group have some common characteristics.

 

Confusion Matrix

A confusion matrix is a technique that defines the performance of a classification algorithm on a set of test data for which the true values are known. The matrix uses the True Positive(TP), True Negative (TN), False Positive(FP) and False Negative(FN) to evaluate the performance.

 

Control Set

When using cross-validation to test a learning model, it is trained on the trainig set, and then its performance is tested on the control set (also known as ‘test set’)

 

Continuous Variable

A variable that can take on any value between its defined minimum value and maximum values

 

Cross-Validation

The process of dividing available data into a test set and a training set. Training is then done on the training set, and then to test the performance of the learned model it is applied to the test set (or ‘control set’). Cross-vaildation, therefore, helps us to evaluate the performance of a model to make predictions on unseen data.

 

Data Mining

The process of finding patterns, relationships, correlations and anomalies within large data sets to predict outcomes.

 

Data Science

The interdisciplinary field of using tools and techniques to extract knowledge and insights from both structured and unstructured data.

 

Data Structure

A particular organization of units of data that enables efficient access and modification, such as an array or a tree.

 

Data Wrangling

The process of preparing complex data for easy access and analysis. It can involve cleaning the data, unifying the data, converting the format of the data and combining the data for easier organization

 

Decision Boundary

A boundary that separates the elements of one class from those of another class.

 

Decision Tree

A flowchart-like tree structure modelling a supervised learning algorithm. A decision tree represents a number of possible decision paths and an outcome for each path, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf represents a decision.

 

Deep Learning

A subset of machine learning that uses multi-layered neural networks to accelerate the learning process for computers. Deep learning is a key technology behind driverless cars.

 

Dependent Variable

A variable under test that changes when changes are made to an independent variable.

 

Dimension Reduction

The process of reducing the dimensional representation of the data without losing much important information. When datasets have more than 3 dimensions, dimension reduction techniques prove particularly useful. Principal Component Analysis is one such technique.

 

Discrete Variable

A variable that can only taken one of a specific number of values.

 

Exploratory Data Analysis

An analytical approach used for summarizing and visualizing data, and for discovering insights from data

 

Gradient Boosting

A machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.

 

Gradient Descent

An optimization algorithm used to minimize some function by iteratively moving in the direction of steepest ascent/descent as defined by the positive/negative of the gradient. For maximising a function, for example, the algorithm would progress in the direction that causes the function to increase the most), repeated for each new starting point.

 

Graphics Processing Unit (Gpu)

A computer chip that performs complex mathematical calculations rapidly, mainly for the purposes of image rendering. Often used by deep learning models, which require large processing power

 

Hyperparameter

Parameters in a model that cannot be learnt from the data and is therefore determined by experimenting with different values to work out which is most suitable.

 

Independent Variable

A variable that, when changed, causes a change in a dependent variable.

 

K-Means Clustering

A type of unsupervised learning algorithm for use with unlabeled data (without defined categories or groups). The aim is for the algorithm to classify the data set through a certain number of clusters (represented by variable K).

 

K-Nearest Neighbors

A supervised learning algorithm with the goal to is to use a database in which similar data points are separated into groups to predict the classification of a new data point.

 

Labeling

The process of adding meaningful tags to data. Models can use labeled datasets to predict labels for unseen, unlabeled data

 

Latent Variable

A variable that can not be computed directly, but rather is inferred using models from observed data. For example, if answers in an IQ test are observed data, then an estimate of the IQ is the latent variable

 

Learning Rate

A scalar that’s applied to the gradient in Gradient Descent Algorithms to determine the next point in relation to the previous point. The learning rate is a Hyperparameter that if picked too low, will result in learning taking too long; and if picked too high the model may never be converged. As such, selecting the optimal learning rate is important

 

Linear Regression

A regression technique that is used to determine the linear relationship between a dependent variable and one or more independent variables.

 

Logistic Regression

A statistical regression model that is applied when the dependent variable is binary. As such, it explains the relationship between one dependent binary variable and one or more independent variables.

 

Machine Learning

A branch of artifical intelligence. Machine learning involves the process of data analysis to learn and generate analytical models which can perform intelligent action on unseen data, with minimal human intervention

 

Markov Chain

A stochastic model used to model random processes. It describes a sequence of possible events where the probability of each event depends solely on the outcome of the previous event.

 

Model Fitting

The process of verifying the performance of a model against a data set.

 

Model Selection

The process of selecting the appropriate statistical model from a wider set of possible models, given data

 

Model Tuning

The process of adjusting parameters of a given model in order to improve its overall performance

 

Natural Language Processing (NLP)

A sub-field of artifical intelligence that studies how computers can interact with humans based on their understanding of natural language as spoken by humans. NLP aims to improve computers’ understanding of language to a level as close as possible as that of humans.

 

Neural Network

Modelled on the human brain, a neural network is a set of algorithms designed to recognize patterns. They take arbitrary inputs, process them and use the activation function to genertae outputsinterpret sensory data through a kind of machine perception, labeling or clustering raw input. The patterns they recognize are numerical, contained in vectors, into which all real-world data, be it images, sound, text or time series, must be translated.

 

Outlier

Extreme values in data representing unusual observations, errors in measurement and recording, or accurate reporting of rare events

 

Overfitting

When a model has been excessively trained on specific data, such that it becomes inapplicable to other datasets. The model is so specific to the original data that any attempts to apply it to previously unseen datasets results in erroneous, sub-optimal outcomes.

 

Pivot Table

A table of statistics used in data processing that allows you to summarize data from a more extensive database and explore it dynamically. Pivot tables can be rearranged (or ‘pivoted’) to explore the data from various perspectives

 

Predictive Analytics

The process of using data, statistical algorithms and machine learning techniques to predict future outcomes based on historical data. By knowing what has already happened, predictive analytics allows us to provide the best assessment of what will happen in the future.

 

Principal Component Analysis

Mainly used as a tool in exploratory data analysis, Principal component analysis (PCA) simplifies the complexity of high-dimensional data, usually by transforming the data into fewer dimensions but still retains most of the information

 

Probability Distribution

A mathematical function that provides the probabilities of occurrence of different possible outcomes of a statistical experiment.

 

Python

An interpreted, general purpose and high-level programming language. Python is popular for use in data science, partly due to its power when working with specialized libraries such as those designed for machine learning and graph generation.

 

R

A statistical programming language and free software environment widely used by data scientists

 

R-Squared

A statistical measure that determines the closeness between predicted values and actual values. In a regression model, the metric represents the proportion of the variance for a dependent variable that’s explained by one or more independent variables.

 

Random Forest

An ensemble learning method that works by combining many decision trees in a single model. By pooling predictions from multiple trees, a random forest will enable more robust predictions than from a single decision tree.

 

Recurrent Neural Network (RNN)

A class of artificial neural network containing loops that allow information to persist. it is designed to recognize a data’s sequential characteristics and use patterns to predict the next likely scenario.

 

Regression

A statistical technique used to measure the strength of the relationship between a dependent variable and one or more independent variables

 

Reinforcement Learning

A type of machine learning algorithm whereby rather than meeting pre-defined goals, agents take decisions and learn whether they are progressing or not as they go along.

 

Sigmoid Function

A mathematical function with an “S”-shaped curve or sigmoid curve, whereby the output approches 1 as the input approaches positive infinity, and the output approaches 0 as the input approaches negative infinity. Is often used in the special case of the logistic function

 

Softmax

A function used in multiple classification logistic regression that takes an un-normalized vector, and normalizes it into a probability distribution.

 

Supervised Learning

A type of machine learning algorithm in which both input and actual output data are provided, and both are labelled as such for classification purposes. The model is then trained to ensure the predicted output is close to the actual output, in order to make similarly trained predictions on unseen data going forward

 

Testing Set

Data that is specifically used to evaluate the performance of a model which has been trained using a separate Training Set

 

Training Set

Data that is specifically used to train a model, which can then be tested with the separate Testing Set

 

Time Series Data

A form of data that measures how things change over time. As such, the data is indexed by time, and time itself is a primary axis.

 

Underfitting

When a model is not sufficiently complex to accurately capture the relationships between a dataset’s features and a target variable, and therefore does not work well on new data

 

Unsupervised Learning

A machine learning technique of providing the model with only input data (and NOT output data, unlike in the case of Supervised Learning). The model is then trained to learn patterns in the data, to then be used to make predictions on unseen data

 

Vector

An ordered set of real numbers, each denoting a distance on a coordinate axis. These numbers can represent a series of details about the specific entity being modeled

 

Weight

Measures the strength of the connection between two neurons in two successive layers of a neural network.