Please, read the whole question before answering, the exact details of the implementation are important.
Suppose that you want to find largest cardinality bipartite matching in bipartite graph with $ V = L + R$ vertices ($ L$ is the number of vertices in the lefthand side and $ R$ is the number of the vertices in the righthand side) and $ E$ edges. You may assume that graph is connected, therefore $ E \geqslant V – 1$ .
Vertices in the lefthand side are numbered with integers from range $ [0, L)$ . Similarly, vertices in the righthand side are numbered with integers from range $ [0, R)$ . Then, the classic implementation of Kuhn’s bipartite matching algorithm looks like this:
bool dfs_Kuhn (v, neigh, used, left_match, right_match): if used[v] return v used[v] = true for dest in neigh[v] if right_match[dest] == 1  dfs(dest, neigh, used, left_match, right_match) left_match[v] = dest right_match[dest] = v return true return false int bipartite_matching_size (neigh): left_match = [1 repeated L times] right_match = [1 repeated R times] for i in [0, L) used = [false repeated L times] dfs_Kuhn(i, neigh, used, left_match, right_match) return L  (number of occurences of 1 in left_match)
This implementation works in $ O(VE)$ time, moreover the bound is tight more or less independently of relations between $ V$ and $ E$ . In other words, the bound is tight for sparse graphs ($ E = O(V)$ ), for dense graphs ($ E = \Omega(V^2)$ ) and for everything inbetween.
There is an implementation that works much faster in practice. The $ \texttt{dfs_Kuhn}$ function does not change, but $ \texttt{bipartite_matching_size}$ changes:
int bipartite_matching_size_fast (neigh): left_match = [1 repeated L times] right_match = [1 repeated R times] shuffle(neigh) for row in neigh shuffle(row) while true used = [false repeated L times] found_path = false for i in [0, L) if left_match[i] == 1 found_path = dfs_Kuhn(i, neigh, used, left_match, right_match) if !found_path break return L  (number of occurences of 1 in left_match)
Of course, upper bound of $ O(VE)$ can be proven for the faster version as well. Lower bounds are completely different story, though.
We used two optimizations:

The block of code inside $ \texttt{while true}$ works in total $ O(E)$ time, but often finds several disjoint augmenting paths, instead of at most one, as did the block inside $ \texttt{for in in [0, L)}$ in the original code.

The order of vertices in the lefthand side and the order in which the forloop $ \texttt{for dest in neigh[v]}$ considers their neighbours are now random.
If only the first of these two optimisations is used, there are some relatively wellknown degenerate cases when the code still takes $ \Omega(VE)$ time. However, almost all such cases that I know abuse specific ordering of neighbours of the lefthand side vertices, so the $ \texttt{dfs_Kuhn}$ function is forced to repeatedly go along some fixed very long path and “flip it”. Therefore, they fall apart when the second optimisation is added.
The only semistrong test I know is a dense ($ E = \Theta(V^2)$ ) graph, in which the fast version of Kuhn’s algorithm takes $ \Omega(V^3 / \log V)$ time. However, all my attempts to generalise that construction to sparse graphs (the case I am most interested in) were unsuccesful.
So, I want to ask the following question. Is something known about runtime of this fast version of Kuhn’s algorithm on sparse graphs? Any nontrivial lower bounds (better than $ \Theta(E \cdot \log V)$ )? Maybe some upper bounds (one my friend believes that this algorithm always runs in $ O(E \sqrt{E})$ time, which seems to be the case on random bipartite graphs)?