A/B Testing
The practice of testing 2 variants of a single variable (eg a webpage) to determine which one is better
Activation Function
A function that maps an input signal of a node in an Artificial Neural Network to an output signal. This output is then used as the input for the next node
Artificial Intelligence
An area of computer science that focuses on the creation of intelligent machines that operate and react much like humans do. Such machines process inputs from the environment using experiences it has gained to take intelligent decisions
Autoencoders
A neural network whereby the unsupervised learning algorithm is trained to set the output target values are equal to the inputs. The autoencoder’s goals is to reduce the number of random variables under consideration, such that the input is represented in fewer dimensions (see Dimension Reduction).
Backpropagation
Refers to the “backward propagation of errors”. A technique used in artificial neural networks to calculate a gradient for updating the weights and biases in the network. The error is computed at the output and distributed backwards throughout the network’s layers.
Bayes’ Theorem
A mathematical formula for determining conditional probabilities – P(A|B) = P(B|A) / P(A)*P(B). The formula gives the probability of an event when we already have information about some other condition related to the event.
Bayesian Network
A probabilistic graphical model for representing multivariate probability distributions. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given the symptoms, the network can be used to work out the probabilities of the presence of various diseases. Also known as “belief networks” or “causal networks”
Bias
In machine learning, bias occurs when results are systematically prejudiced because of errors in the machine learning process that cause the learner to consistently learn the same wrong thing.
Big Data
Collections of data – both structured and unstructured – once deemed impractical due to their volume, velocity, and variety, and could not be processed by traditional processing applications. Big Data can now be processed and analyzed to generate insights from the data and make better decisions.
Chi-Square Test
A statistical hypothesis test to determine if two categorical variables are related in a population.
Classification
The process of determining the class (or targets, labels or categories) of given data points. Classes are sometimes called as targets/ labels or categories
Clustering
The process of organizing data into groups such that all members of each group have some common characteristics.
Confusion Matrix
A confusion matrix is a technique that defines the performance of a classification algorithm on a set of test data for which the true values are known. The matrix uses the True Positive(TP), True Negative (TN), False Positive(FP) and False Negative(FN) to evaluate the performance.
Control Set
When using cross-validation to test a learning model, it is trained on the trainig set, and then its performance is tested on the control set (also known as ‘test set’)
Continuous Variable
A variable that can take on any value between its defined minimum value and maximum values
Cross-Validation
The process of dividing available data into a test set and a training set. Training is then done on the training set, and then to test the performance of the learned model it is applied to the test set (or ‘control set’). Cross-vaildation, therefore, helps us to evaluate the performance of a model to make predictions on unseen data.
Data Mining
The process of finding patterns, relationships, correlations and anomalies within large data sets to predict outcomes.
Data Science
The interdisciplinary field of using tools and techniques to extract knowledge and insights from both structured and unstructured data.
Data Structure
A particular organization of units of data that enables efficient access and modification, such as an array or a tree.
Data Wrangling
The process of preparing complex data for easy access and analysis. It can involve cleaning the data, unifying the data, converting the format of the data and combining the data for easier organization
Decision Boundary
A boundary that separates the elements of one class from those of another class.
Decision Tree
A flowchart-like tree structure modelling a supervised learning algorithm. A decision tree represents a number of possible decision paths and an outcome for each path, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf represents a decision.
Deep Learning
A subset of machine learning that uses multi-layered neural networks to accelerate the learning process for computers. Deep learning is a key technology behind driverless cars.
Dependent Variable
A variable under test that changes when changes are made to an independent variable.
Dimension Reduction
The process of reducing the dimensional representation of the data without losing much important information. When datasets have more than 3 dimensions, dimension reduction techniques prove particularly useful. Principal Component Analysis is one such technique.
Discrete Variable
A variable that can only taken one of a specific number of values.
Exploratory Data Analysis
An analytical approach used for summarizing and visualizing data, and for discovering insights from data
Gradient Boosting
A machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.
Gradient Descent
An optimization algorithm used to minimize some function by iteratively moving in the direction of steepest ascent/descent as defined by the positive/negative of the gradient. For maximising a function, for example, the algorithm would progress in the direction that causes the function to increase the most), repeated for each new starting point.
Graphics Processing Unit (Gpu)
A computer chip that performs complex mathematical calculations rapidly, mainly for the purposes of image rendering. Often used by deep learning models, which require large processing power
Hyperparameter
Parameters in a model that cannot be learnt from the data and is therefore determined by experimenting with different values to work out which is most suitable.
Independent Variable
A variable that, when changed, causes a change in a dependent variable.
K-Means Clustering
A type of unsupervised learning algorithm for use with unlabeled data (without defined categories or groups). The aim is for the algorithm to classify the data set through a certain number of clusters (represented by variable K).
K-Nearest Neighbors
A supervised learning algorithm with the goal to is to use a database in which similar data points are separated into groups to predict the classification of a new data point.
Labeling
The process of adding meaningful tags to data. Models can use labeled datasets to predict labels for unseen, unlabeled data
Latent Variable
A variable that can not be computed directly, but rather is inferred using models from observed data. For example, if answers in an IQ test are observed data, then an estimate of the IQ is the latent variable
Learning Rate
A scalar that’s applied to the gradient in Gradient Descent Algorithms to determine the next point in relation to the previous point. The learning rate is a Hyperparameter that if picked too low, will result in learning taking too long; and if picked too high the model may never be converged. As such, selecting the optimal learning rate is important
Linear Regression
A regression technique that is used to determine the linear relationship between a dependent variable and one or more independent variables.
Logistic Regression
A statistical regression model that is applied when the dependent variable is binary. As such, it explains the relationship between one dependent binary variable and one or more independent variables.
Machine Learning
A branch of artifical intelligence. Machine learning involves the process of data analysis to learn and generate analytical models which can perform intelligent action on unseen data, with minimal human intervention
Markov Chain
A stochastic model used to model random processes. It describes a sequence of possible events where the probability of each event depends solely on the outcome of the previous event.
Model Fitting
The process of verifying the performance of a model against a data set.
Model Selection
The process of selecting the appropriate statistical model from a wider set of possible models, given data
Model Tuning
The process of adjusting parameters of a given model in order to improve its overall performance
Natural Language Processing (NLP)
A sub-field of artifical intelligence that studies how computers can interact with humans based on their understanding of natural language as spoken by humans. NLP aims to improve computers’ understanding of language to a level as close as possible as that of humans.
Neural Network
Modelled on the human brain, a neural network is a set of algorithms designed to recognize patterns. They take arbitrary inputs, process them and use the activation function to genertae outputsinterpret sensory data through a kind of machine perception, labeling or clustering raw input. The patterns they recognize are numerical, contained in vectors, into which all real-world data, be it images, sound, text or time series, must be translated.
Outlier
Extreme values in data representing unusual observations, errors in measurement and recording, or accurate reporting of rare events
Overfitting
When a model has been excessively trained on specific data, such that it becomes inapplicable to other datasets. The model is so specific to the original data that any attempts to apply it to previously unseen datasets results in erroneous, sub-optimal outcomes.
Pivot Table
A table of statistics used in data processing that allows you to summarize data from a more extensive database and explore it dynamically. Pivot tables can be rearranged (or ‘pivoted’) to explore the data from various perspectives
Predictive Analytics
The process of using data, statistical algorithms and machine learning techniques to predict future outcomes based on historical data. By knowing what has already happened, predictive analytics allows us to provide the best assessment of what will happen in the future.
Principal Component Analysis
Mainly used as a tool in exploratory data analysis, Principal component analysis (PCA) simplifies the complexity of high-dimensional data, usually by transforming the data into fewer dimensions but still retains most of the information
Probability Distribution
A mathematical function that provides the probabilities of occurrence of different possible outcomes of a statistical experiment.
Python
An interpreted, general purpose and high-level programming language. Python is popular for use in data science, partly due to its power when working with specialized libraries such as those designed for machine learning and graph generation.
R
A statistical programming language and free software environment widely used by data scientists
R-Squared
A statistical measure that determines the closeness between predicted values and actual values. In a regression model, the metric represents the proportion of the variance for a dependent variable that’s explained by one or more independent variables.
Random Forest
An ensemble learning method that works by combining many decision trees in a single model. By pooling predictions from multiple trees, a random forest will enable more robust predictions than from a single decision tree.
Recurrent Neural Network (RNN)
A class of artificial neural network containing loops that allow information to persist. it is designed to recognize a data’s sequential characteristics and use patterns to predict the next likely scenario.
Regression
A statistical technique used to measure the strength of the relationship between a dependent variable and one or more independent variables
Reinforcement Learning
A type of machine learning algorithm whereby rather than meeting pre-defined goals, agents take decisions and learn whether they are progressing or not as they go along.
Sigmoid Function
A mathematical function with an “S”-shaped curve or sigmoid curve, whereby the output approches 1 as the input approaches positive infinity, and the output approaches 0 as the input approaches negative infinity. Is often used in the special case of the logistic function
Softmax
A function used in multiple classification logistic regression that takes an un-normalized vector, and normalizes it into a probability distribution.
Supervised Learning
A type of machine learning algorithm in which both input and actual output data are provided, and both are labelled as such for classification purposes. The model is then trained to ensure the predicted output is close to the actual output, in order to make similarly trained predictions on unseen data going forward
Testing Set
Data that is specifically used to evaluate the performance of a model which has been trained using a separate Training Set
Training Set
Data that is specifically used to train a model, which can then be tested with the separate Testing Set
Time Series Data
A form of data that measures how things change over time. As such, the data is indexed by time, and time itself is a primary axis.
Underfitting
When a model is not sufficiently complex to accurately capture the relationships between a dataset’s features and a target variable, and therefore does not work well on new data
Unsupervised Learning
A machine learning technique of providing the model with only input data (and NOT output data, unlike in the case of Supervised Learning). The model is then trained to learn patterns in the data, to then be used to make predictions on unseen data
Vector
An ordered set of real numbers, each denoting a distance on a coordinate axis. These numbers can represent a series of details about the specific entity being modeled
Weight
Measures the strength of the connection between two neurons in two successive layers of a neural network.