When to use random forest

I understand Random Forest models can be used both for classification and regression situations. Is there a more specific criteria to determine where a random forest model would perform better than common regressions (Linear, Lasso, etc) to estimate values or Logistic Regression for classification?

How to find size of each tree in a forest?

I have a tree of n nodes. If I remove k edges of this tree, I will have k+1 new trees i.e. a forest of k+1 trees. How do I calculate the number of nodes in each of these new trees formed?

I need to do this for q queries.

My approach: I can remove the k edges from the original tree and run dfs on each nodes from which edge was deleted and find the size of the new tree. And then join the nodes back after getting size of each tree. But this is not optimal, can anyone suggest a better approach.

Sorting vertices by MST in a forest of MSTs

Consider a number of vertices. They are separated into a number of MSTs (so there’s a forest of MSTs) using Kruskal’s algorithm. For each vertex I need to know in which MST they are located. Each MST will be assigned a unique ID.

When completed I will know, for example, that Vertex #12 is in MST #2.

This is my best attempt at this. As you can see the algorithm failed because it falsely tagged one MST as two separate MSTs (Blue Group 0 and Blue Group 1):

enter image description here

I’m looking for an algorithm that can do this. Thanks in advance. I’m a programmer not a mathematician.

Is it possible that AD forest have different versions of AD on different sites?

As I plan upcoming AD upgrade (from 2008R2 to 2019) I face another problem: we’re actually have AD Forest that consists of several sites, each of them on 2008R2 AD version. As we upgrade one site AD version from 2008R2 to 2019, other sites will remain on 2008R2 for quite site time until we’ll see 2019 scheme works well and no problem appears. And now I doubt if the whole forest be functional and no problem arise when one site be on 2019 and other sites on 2008R2 AD scheme versions.

So the question is: in practice (that is, in real life) will the forest still be functional and usable as one site (and, eventually, one by one all other sites, too) be upgraded, while remaining sites still be on lower AD version? In general, the forest needed to have cross-sites auth, so no sophisticated features are needed, but anyway I’d better ask before jumping into the water 🙂

How can I validate/estimate the accuracy of the OOB error in Random Forest?

I might not exactly be at the right place here, but I’ll give it a go. I am using RF to approximate results from a simulation tool that I use since the simulations are quite computationally intensive. I am trying to develop a procedure to optimize the location of the points in my predictor space in order to use as little points as possible for my training data and thus run as little simulations as possible. To do this I am relying on the OOB error estimate that coms with ranger (a C++ implementation of RF).

My problem is that I have no idea how big my data set needs to be for the OOB error estimate to be even remotely accurate. I guess that someone somewhere must have tried assess the significance/accuracy of the OOB error estimate, but I was unable to find it.

Does anyone know of some literature, blog article, pre-coded method that can help me achieve this goal?

Random Forest – Conditional Permutation Importance

I’ve been looking for the most unbiased algorithm to find out the feature importances in random forests if there are correlations among the input features.

Besides the most commonly preferred methodologies; gini-impurity reduction, drop-column importance and permutation importance, I found an algorithm called conditional permutation importance, in the given article: (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-307#Sec8)

The steps for calculating the conditional permutation importance are given in the article like this:

  1. In each tree compute the oob-prediction accuracy before the permutation
  2. For all variables Z to be conditioned on: Extract the cutpoints that split this variable in the current tree and create a grid by means of bisecting the sample space in each cutpoint.
  3. Within this grid permute the values of X j and compute the oob-prediction accuracy after permutation
  4. The difference between the prediction accuracy before and after the permutation accuracy again gives the importance of X j for one tree. The importance of X j for the forest is again computed as an average over all trees.

For the first step, I’m having difficulties to reach oob scores of each tree as the default oob_score is calculated for all trees in the forest in scikit’s methods. However, since I can still reach single trees as decision trees, I tried test inputs in these trees instead of oob samples but the kernel kept dying…

  • Sample RF Classiffier

    clf=RandomForestClassifier(n_estimators=200,max_depth=3,oob_score = True) forest = clf.fit(train_inputs_arr, train_targets_arr)

  • Reaching single decision tree properties

forest[0].tree_.predict(test_inputs)

forest[0].tree_.predict(test_inputs) 

For the second step, I’m having difficulty to understand what is meant by “creating a gird by means of bisecting the sample space at each cutpoint”, and didn’t really understand if I should determine the cutpoints of the selected Xj or for the other variables Z to be conditioned on.

Additionally, I’m also sharing the permutation importance method structure that I previously used, It simply permutes every feature calculates how the oob score decreases for each feature after permutation and the highest decrease in the oob score means higher feature importance. I wanted to modify this structure but I’m theoretically stuck at this point. What I really want to learn is any implementation of this algorithm on python.

def permutation_importances(rf, x_tr, y_train): rf.fit(x_tr,y_train) baseline = rf.oob_score_ imp = [] for col in x_tr.columns:     rf_ = rf     save = x_tr[col]     x_tr.loc[:,col] = np.random.permutation(save)     rf_.fit(x_tr, y_train)     m = rf_.oob_score_     x_tr.loc[:,col] = save     imp.append(baseline - m) 

Forest Fire Cellular Automata

enter image description here

I stumbled on the idea of a Forest Fire simulating cellular automata, and decided to try making a version using a full Seesaw UI (a Clojure wrapper over Java’s Swing). A short sample of it running is above.

It ended up being fairly straightforward; although I’ve written a fair number of CA’s before, so this was more just applying a specific rule-set, and designing a quick UI.

It can be used in the console too using format-world. T‘s are trees, and #s are fire squares:

(let [r (g/new-rand-gen 99)       w (new-empty-world [10 10] (new-chance-settings 0.3 0.1))        worlds (->> w                   (iterate #(advance % r))                   (take 5))]    (doseq [world worlds]     (println (format-world world) "\n")))                    T             T         T                 T T T   T                                   T               T T T     T                 T     T   T           T                         T         T   T   T     T   T     T T T   T     #           T     T T T   T   T                 T T     T     T T             T T T     T T T             T T   T   T     T     T T T     T   T   T T   T T   T     #   T   T   T T   #     # T T   T T T     T       T     # # #   #   T     T   T T     T T     T     T T T     T     T T T T   T T T       T     T T T T T T     T   T T T T T   T T # T T T T T T T T         #   T T T T           # T   T # #     #     T T           T     T     #   # #     # T T T T   T T T T   T T T   T T T T T T # T   T   #     T # T T T T     #   T T # T T T T #   # # T T T T T T  

I’d like any general suggestions. I’m not very organized, so my UI tends to become a mess. Any suggestions for improvements there especially would be appreciated.


Helper functions for dealing with 2D vectors

(ns forest-fire.grid   (:require [clojure.string :as s]))  (defn new-grid [dimensions initial-cell]   (let [[w h] dimensions]     (vec (repeat h                  (vec (repeat w initial-cell))))))  (defn set-cell [grid position new-cell]   (let [[x y] position]     (assoc-in grid [y x] new-cell)))  (defn get-cell [grid position]   (let [[x y] position]     (get-in grid [y x])))  (defn dimensions-of [grid]   [(count (first grid))    (count grid)])  (defn positions-of   "Returns a lazy list of every integer coordinate in the grid."   [grid]   (let [[w h] (dimensions-of grid)]     (for [y (range h)           x (range w)]       [x y])))  (defn format-grid   "Gives each cell to str-map to decide what str is used in formating.   Uses \" \" if str-map returns nil"   [grid str-map]   (->> grid        (map (fn [row] (map #(or (str-map %) " ") row)))        (map #(s/join " " %))        (s/join "\n"))) 

An abridged part of my personal library used in the project

(ns helpers.general-helpers)  (defn new-rand-gen   ([seed] (Random. seed))   ([] (Random.)))  (defn random-perc   "Returns true perc-chance percent of the time."   [^double perc-chance, ^Random rand-gen]   (<= (.nextDouble rand-gen)       perc-chance))  (defn map-range   "Maps a value from the range [start1 stop1] to a value in the range [start2 stop2]."   [value start1 stop1 start2 stop2]   (+ start2      (* (- stop2 start2)         (/ (- value start1)            (- stop1 start1)))))  (defn parse-double   "Returns nil on bad input."   [str-n]   (try     (Double/parseDouble str-n)      (catch NumberFormatException _       nil))) 

The main state of the simulation

(ns forest-fire.world   (:require [forest-fire.grid :as gr]             [helpers.general-helpers :as g]))  (def cell-types #{::empty, ::tree, ::burning})  (defn new-chance-settings [grow-chance burn-chance]   {:grow-chance grow-chance, :burn-chance burn-chance})  (defn new-empty-world [dimensions settings]   {:grid (gr/new-grid dimensions ::empty)    :settings settings})  (defn dimensions-of [world]   (-> world :grid (gr/dimensions-of)))  (defn- positions-around   "Returns the coordinates around a cell representing the Moore neighborhood of the cell (when (= depth 1))."   ([position dimensions depth]    (let [[cx cy] position          [w h] dimensions]       (for [y (range (max 0 (- cy depth)) (min h (+ cy depth 1)))            x (range (max 0 (- cx depth)) (min w (+ cx depth 1)))            :let [neigh-pos [x y]]            :when (not= neigh-pos position)]         neigh-pos)))    ([position dimensions]    (positions-around position dimensions 1)))  (defn- neighbors-of   "Returns the cells in the Moore neighborhood of a cell."   [grid position]   (let [dims (gr/dimensions-of grid)]     (->> (positions-around position dims)          (map #(gr/get-cell grid %)))))  (defn- has-burning-neighbor? [grid position]   (->> (neighbors-of grid position)        (some #(= % ::burning))))  (defn- advance-cell [new-grid old-grid position chance-settings rand-gen]   (let [contents (gr/get-cell old-grid position)         {:keys [grow-chance burn-chance]} chance-settings]      (case contents       ::burning (gr/set-cell new-grid position ::empty)        ::empty (if (g/random-perc grow-chance rand-gen)                 (gr/set-cell new-grid position ::tree)                 new-grid)        ::tree (cond                ; Doing this check first since it'll be much cheaper                (g/random-perc burn-chance rand-gen)                (gr/set-cell new-grid position ::burning)                 (has-burning-neighbor? old-grid position)                (gr/set-cell new-grid position ::burning)                 :else                new-grid))))  (defn- advance-grid [grid chance-settings rand-gen]   (reduce (fn [acc-grid pos]             (advance-cell acc-grid grid pos chance-settings rand-gen))           grid           (gr/positions-of grid)))  (defn advance   "Advances the forest fire world by 1 \"tick\""   [world rand-gen]   (let [{:keys [settings]} world]     (update world :grid advance-grid settings rand-gen)))   (defn format-world   "Formats the world into a String to be printed."   [world]   (let [format-f {::empty " ", ::burning "#", ::tree "T"}]     (gr/format-grid (:grid world) format-f))) 

The main UI

(ns forest-fire.seesaw-ui   (:require [seesaw.core :as sc]             [seesaw.graphics :as sg]              [forest-fire.world :as fw]             [forest-fire.grid :as gr]              [helpers.general-helpers :as g])    (:import [java.awt Component]            [javax.swing Timer]))  (def settings-font {:name "Arial", :size 40})  (def global-rand-gen (g/new-rand-gen 99))  (def default-advance-delay 100)  (defn component-dimensions [^Component c]   [(.getWidth c), (.getHeight c)])  (defn world->canvas-position [world-position world-dimensions canvas-dimensions]   (let [[x y] world-position         [ww wh] world-dimensions         min-canvas-dim (apply min canvas-dimensions)]     [(g/map-range x, 0 ww, 0 min-canvas-dim)      (g/map-range y, 0 wh, 0 min-canvas-dim)]))  (defn square-size [world-dimensions canvas-dimensions]   (double (/ (apply min canvas-dimensions)              (apply min world-dimensions))))  (defn paint [world-atom canvas g]   (let [{:keys [grid] :as world} @world-atom          world-dims (fw/dimensions-of world)         canvas-dims (component-dimensions canvas)         sq-width (square-size world-dims canvas-dims)          w->c #(world->canvas-position % world-dims canvas-dims)]      (doseq [world-pos (gr/positions-of grid)]       (let [cell (gr/get-cell grid world-pos)]         (when-not (= cell ::fw/empty)           (let [[cx cy] (w->c world-pos)                 col (case cell                       ::fw/tree :green                       ::fw/burning :red)]             (sg/draw g                (sg/rect cx cy sq-width sq-width)                (sg/style :background col, :foreground :black))))))))  (defn new-canvas [state-atom]   (let [canvas (sc/canvas :paint (partial paint state-atom),                           :id :canvas)]     canvas))  (defn new-input-pair [label-text settings-key state-atom]   (let [current-world @state-atom         initial-value (get-in current-world [:settings settings-key])          label (sc/label :text (str label-text),                         :font settings-font)          input (sc/text :text (str initial-value),                        :font settings-font)]      (sc/listen input        :key-released        (fn [_]          (when-let [parsed (g/parse-double (sc/text input))]            (swap! state-atom                   assoc-in [:settings settings-key] parsed))))      (sc/horizontal-panel :items [label input])))  (defn new-main-panel [state-atom]   (let [canvas (new-canvas state-atom)         settings-bar (sc/horizontal-panel                        :items [(new-input-pair "Burn Chance" :burn-chance state-atom)                                (new-input-pair "Grow Chance" :grow-chance state-atom)])]      (sc/border-panel :center canvas,                      :south settings-bar)))  (defn world-advancer   "Advances the world and repaints the canvas every advance-delay ms.   Returns a 0-arity function that stops the timer when called."   [state-atom canvas advance-delay]   (let [t (sc/timer             (fn [_]               (swap! state-atom fw/advance global-rand-gen)               (sc/repaint! canvas))             :delay advance-delay)]      (fn [] (.stop ^Timer t))))  (defn new-frame [starting-world]   (let [state-atom (atom starting-world)          main-panel (new-main-panel state-atom)         canvas (sc/select main-panel [:#canvas])         frame (sc/frame :content main-panel, :size [1000 :by 1000])          stop-advancer (world-advancer state-atom canvas default-advance-delay)]      (sc/listen frame        :window-closing (fn [_]                          (stop-advancer)))      frame))  (defn -main []   (let [w (-> (fw/new-empty-world [100 100] (fw/new-chance-settings 0.9 0.0001))               (fw/advance global-rand-gen)               (fw/advance global-rand-gen))]      (-> (new-frame w)         (sc/show!))))