Uses a linear function to divide each set into two subsets
- For 3+ classes: fit N-1 lines (N classes)
If the nearest instance to the previously unseen instance is A, then the class is also A, otherwise it is B.
- the decision surface of a NN classifier (also called Theissen regions) is implicit and divide the space into regions "belonging" to each instance and corresponding class.
Generalization of the nearest neighbor algorithm
- Find the nearest K (typically chosen to be an odd number) instances
- Each one represents a vote
- Using more attributes can sometimes lead to getting the wrong classification for an instance, when using less gave the right one;
- Nevertheless, using some subsets can also lead to wrong classifications.
- Simple to implement
- Handles correlated features
- Defined for any distance measure
- Handles streaming data trivially
- Very sensitive to irrelevant features
- Slow classification time for large datasets
- Work best for real valued datasets
- Prepruning: halt tree construction early
- do not split a node if this would result in the goodness measure falling below a threshold
- Postpruning: get a sequence of progressively pruned tree; use a set of data different from the training data to decide which is the "best pruned tree"
- Easy to understand
- Easy to generate rules
- Overfitting
- Does not handle correlated features very well (classified by rectangular partitioning)
- Can be quite large (pruning is necessary)
Use Bayes theorem, which says:
- The probability of an instance being in a certain class is the probability of that instance being generated by that class times the probability of occurence of that class, divided by the probability of that instance existing.
- Assuming independent distributions, the probability of a certain class generating the instance is the multiplication of the probability of that class generating the observed value for each of the features of that instance.
- Fast to train (single scan) and to classify
- Not sensitive to irrelevant features
- Handles real and discrete data
- Handles streaming data well
- Assumes independence of features
- Set of neurons (input/output units) connected with weights
- Supervised learning adjusts weights to ensure outputs to given inputs are the expected ones
- Predicting: feed input values; collect outputs
- Universal: fit any continuous function
- Versatile: output may be one or more discrete and real values
- Online: application and learning are intertwined
- Robust to errors and noise data
- Fast application to new examples
- Parallel
- Slow training
- Low usability: empirical parameter tuning; network topology and learning rate
- Low Interpretability: understand the weights
- Low Adaptability: not easy to incorporate domain knowledge
- Linear learning machines with maximisation of margin (better separation between classes);
- Hyperplane farthest from both classes has less risk of overfitting.
- Higher robustness to the curse of dimensionality;
- For non-linear functions, map attributes to space where linear discrimination is possible.
- Minimize the tube "around" the data;
- Instead of maximizing the distance to the closest examples from each class.