I have a dataframe that has 3 columns, Latitude, Longitude and Median_Income. I need to get the average median income for all points within x km of the original point into a 4th column. I need to do this for each observation.
I have tried making 3 functions which I use apply to attempt to do this quickly. However, the dataframes take forever to process (hours). I haven’t seen an error yet, so it appears to be working okay.
The Haversine formula, I found on here. I am using it to calculate the distance between 2 points using lat/lon.
from math import radians, cos, sin, asin, sqrt def haversine(lon1, lat1, lon2, lat2): #Calculate the great circle distance between two points #on the earth (specified in decimal degrees) # convert decimal degrees to radians lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2]) # haversine formula dlon = lon2 - lon1 dlat = lat2 - lat1 a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2 c = 2 * asin(sqrt(a)) r = 6371 # Radius of earth in kilometers. Use 3956 for miles return c * r
My hav_checker function will check the distance of the current row against all other rows returning a dataframe with the haversine distance in a column.
def hav_checker(row, lon, lat): hav = haversine(row['longitude'], row['latitude'], lon, lat) return hav
My value grabber fucntion uses the frame returned by hav_checker to return the mean value from my target column (median_income).
For reference, I am using the California housing dataset to build this out.
def value_grabber(row, frame, threshold, target_col): frame = frame.copy() frame['hav'] = frame.apply(hav_checker, lon = row['longitude'], lat = row['latitude'], axis=1) mean_tar = frame.loc[frame.loc[:,'hav'] <= threshold, target_col].mean() return mean_tar
I am trying to return these 3 columns to my original dataframe for a feature engineering project within a larger class project.
df['MedianIncomeWithin3KM'] = df.apply(value_grabber, frame=df, threshold=3, target_col='median_income', axis=1) df['MedianIncomeWithin1KM'] = df.apply(value_grabber, frame=df, threshold=1, target_col='median_income', axis=1) df['MedianIncomeWithinHalfKM'] = df.apply(value_grabber, frame=df, threshold=.5, target_col='median_income', axis=1)
I have been able to successfully do this with looping but it is extremely time intensive and need a faster solution.