Aggregate Pandas Columns on Geospacial Distance

I have a dataframe that has 3 columns, Latitude, Longitude and Median_Income. I need to get the average median income for all points within x km of the original point into a 4th column. I need to do this for each observation.

I have tried making 3 functions which I use apply to attempt to do this quickly. However, the dataframes take forever to process (hours). I haven’t seen an error yet, so it appears to be working okay.

The Haversine formula, I found on here. I am using it to calculate the distance between 2 points using lat/lon.

from math import radians, cos, sin, asin, sqrt  def haversine(lon1, lat1, lon2, lat2):      #Calculate the great circle distance between two points      #on the earth (specified in decimal degrees)      # convert decimal degrees to radians      lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])      # haversine formula      dlon = lon2 - lon1      dlat = lat2 - lat1      a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2     c = 2 * asin(sqrt(a))      r = 6371 # Radius of earth in kilometers. Use 3956 for miles     return c * r 

My hav_checker function will check the distance of the current row against all other rows returning a dataframe with the haversine distance in a column.

def hav_checker(row, lon, lat):      hav = haversine(row['longitude'], row['latitude'], lon, lat)      return hav 

My value grabber fucntion uses the frame returned by hav_checker to return the mean value from my target column (median_income).

For reference, I am using the California housing dataset to build this out.

def value_grabber(row, frame, threshold, target_col):      frame = frame.copy()      frame['hav'] = frame.apply(hav_checker, lon = row['longitude'], lat = row['latitude'], axis=1)      mean_tar = frame.loc[frame.loc[:,'hav'] <= threshold, target_col].mean()      return mean_tar 

I am trying to return these 3 columns to my original dataframe for a feature engineering project within a larger class project.

df['MedianIncomeWithin3KM'] = df.apply(value_grabber, frame=df, threshold=3, target_col='median_income', axis=1)  df['MedianIncomeWithin1KM'] = df.apply(value_grabber, frame=df, threshold=1, target_col='median_income', axis=1)  df['MedianIncomeWithinHalfKM'] = df.apply(value_grabber, frame=df, threshold=.5, target_col='median_income', axis=1) 

I have been able to successfully do this with looping but it is extremely time intensive and need a faster solution.

Analyzing patient treatment data using Pandas

I work in the population health industry and get contracts from commercial companies to conduct research on their products. This is the general code to identify target patient groups from a provincial datasets, including DAD (hospital discharge), PC (physician claims), NACRS (emergency room visit), PIN (drug dispensation), and REG (provincial registry). Same patients can have multiple rows in each of the databases. For example, if a patient was hospitalized 3 times, s/he will show up as three separate rows in DAD data. The code does the followings:

  1. Import data from csv files into individual Pandas dataframes (df’s)
  2. Then it goes through some initial data cleaning and processing (such as random sampling, date formatting, calling additional reference information (such as icd code for study condition)
  3. Under the section 1) Identify patients for case defn'n #1, a series of steps have been done to label (as tags) each of the relevant data and filtering based on these tags. Datasets are linked together to see if a particular patient fulfills the diagnostic code requirement.
  4. Information also needs to be aggregated by unique patient level via the pivot_table function to summarize by unique patients
  5. At the end, the final patient dataframe is saved into local directory, and analytic results are printed
  6. I also made my own modules feature_tagger to house some of the more frequently-used functions away from this main code
# Overall steps: # 1) Patient defintiion: Had a ICD code and a procedure code within a time period # 2) Output: A list of PHN_ENC of included patients; corresponding index date     # .. 'CaseDefn1_PatientDict_FINAL.txt'     # .. 'CaseDefn1_PatientDf_FINAL.csv' # 3) Results: Analytic results # ----------------------------------------------------------------------------------------------------------  import pandas as pd import datetime import random import feature_tagger.feature_tagger as ft import data_descriptor.data_descriptor as dd import data_transformer.data_transformer as dt import var_creator.var_creator as vc  # Unrestrict pandas' output display pd.set_option('display.max_rows', 500) pd.set_option('display.max_columns', 500) pd.set_option('display.width', 120)  # Control panel save_file_switch = False # WARNING: will overwrite existing when == True df_subsampling_switch = False # WARNING: make to sure turn off for final results edge_date_inclusion = True # whether to include the last date in the range of inclusion criteria testing_printout_switch = False result_printout_switch = True done_switch = True df_subsampling_n = 15000 random_seed = 888  # Instantiate objects ft_obj = ft.Tagger() dt_obj = dt.Data_Transformer()  # Import data loc = 'office' if loc == 'office':     directory = r'E:\My_Working_Primary\Projects\Data_Analysis\' elif loc == 'home':     directory = r'C:\Users\MyStuff\Dropbox\Projects\Data_Analysis\' else: pass  refDataDir = r'_Data\RefData\' realDataDir = r'_Data\RealData\' resultDir = r'_Results\'  file_dad = 'Prepped_DAD_Data.csv' file_pc = 'Prepped_PC_Data.csv' file_nacrs = 'Prepped_NACRS_Data.csv' file_pin = 'Prepped_PIN_Data.csv' file_reg = 'Prepped_REG_Data.csv'  df_dad = pd.read_csv(directory+realDataDir+file_dad, dtype={'PHN_ENC': str}, encoding='utf-8', low_memory=False) df_pc = pd.read_csv(directory+realDataDir+file_pc, dtype={'PHN_ENC': str}, encoding='utf-8', low_memory=False) df_nacrs = pd.read_csv(directory+realDataDir+file_nacrs, dtype={'PHN_ENC': str}, encoding='utf-8', low_memory=False) df_pin = pd.read_csv(directory+realDataDir+file_pin, dtype={'PHN_ENC': str}, encoding='utf-8', low_memory=False) df_reg = pd.read_csv(directory+realDataDir+file_reg, dtype={'PHN_ENC': str}, encoding='utf-8', low_memory=False)  # Create random sampling of df's to run codes faster if df_subsampling_switch==True:     if (df_subsampling_n>len(df_dad))|(df_subsampling_n>len(df_pc))|(df_subsampling_n>len(df_nacrs))|(df_subsampling_n>len(df_pin)):         print ('Warning: Specified subsample size is larger than the total no. of row of some of the dataset,')         print ('As a result, resampling with replacement will be done to reach specified subsample size.')     df_dad = dt_obj.random_n(df_dad, n=df_subsampling_n, on_switch=df_subsampling_switch, random_state=random_seed)     df_pc = dt_obj.random_n(df_pc, n=df_subsampling_n, on_switch=df_subsampling_switch, random_state=random_seed)     df_nacrs = dt_obj.random_n(df_nacrs, n=df_subsampling_n, on_switch=df_subsampling_switch, random_state=random_seed)     df_pin = dt_obj.random_n(df_pin, n=df_subsampling_n, on_switch=df_subsampling_switch, random_state=random_seed)  # Format variable type df_dad['ADMIT_DATE'] = pd.to_datetime(df_dad['ADMIT_DATE'], format='%Y-%m-%d') df_dad['DIS_DATE'] = pd.to_datetime(df_dad['DIS_DATE'], format='%Y-%m-%d') df_pc['SE_END_DATE'] = pd.to_datetime(df_pc['SE_END_DATE'], format='%Y-%m-%d') df_pc['SE_START_DATE'] = pd.to_datetime(df_pc['SE_START_DATE'], format='%Y-%m-%d') df_nacrs['ARRIVE_DATE'] = pd.to_datetime(df_nacrs['ARRIVE_DATE'], format='%Y-%m-%d') df_pin['DSPN_DATE'] = pd.to_datetime(df_pin['DSPN_DATE'], format='%Y-%m-%d') df_reg['PERS_REAP_END_RSN_DATE'] = pd.to_datetime(df_reg['PERS_REAP_END_RSN_DATE'], format='%Y-%m-%d')  # Import reference codes file_rxCode = '_InStudyCodes_ATC&DIN.csv' file_icdCode = '_InStudyCodes_DxICD.csv' file_serviceCode = '_InStudyCodes_ServiceCode.csv'  df_rxCode = pd.read_csv(directory+refDataDir+file_rxCode, dtype={'ICD_9': str}, encoding='utf-8', low_memory=False) df_icdCode = pd.read_csv(directory+refDataDir+file_icdCode, encoding='utf-8', low_memory=False) df_serviceCode = pd.read_csv(directory+refDataDir+file_serviceCode, encoding='utf-8', low_memory=False)  # Defining study's constant variables inclusion_start_date = datetime.datetime(2017, 4, 1, 00, 00, 00)  inclusion_end_date = datetime.datetime(2018, 3, 31, 23, 59, 59)  sp_serviceCode_dict = {df_serviceCode['Short_Desc'][0]:df_serviceCode['Health_Service_Code'][0]} sp_serviceCode_val = sp_serviceCode_dict['ABC injection']  sp_dxCode_dict = {'DIABETES_ICD9': df_icdCode['ICD_9'][0], 'DIABETES_ICD10': df_icdCode['ICD_10'][0]} sp_dxCode_val_icd9 = sp_dxCode_dict['DIABETES_ICD9'] sp_dxCode_val_icd10 = sp_dxCode_dict['DIABETES_ICD10']  # ----------------------------------------------------------------------------------------------------------  # 1) Identify patients for case def'n #1. # Step 1 - Aged between 18 and 100 years old on the index date # Step 2 - Had at least 1 recorded ICD diagnostic code based on physician visit (ICD-9-CA=9999 in PC) or      # hospitalization (ICD-10-CA=G9999 in DAD) during the inclusion period # Step 3.1 - Had at least 1 specific procedure code (99.999O) during      # the inclusion period (Note: earliest ABC injection code date is the Index date) # Step 3.2 - Construct index date # Step 4 - Registered as a valid Alberta resident for 2 years before the index date and 1 year after the      # index date (determined from PR)  # 1.1) Get age at each service, then delete rows with age falling out of 18-100 range df_dad_ageTrimmed = df_dad.copy() df_dad_ageTrimmed = df_dad_ageTrimmed[(df_dad_ageTrimmed['AGE']>=18) & (df_dad_ageTrimmed['AGE']<=100)]  df_pc_ageTrimmed = df_pc.copy() df_pc_ageTrimmed = df_pc_ageTrimmed[(df_pc_ageTrimmed['AGE']>=18) & (df_pc_ageTrimmed['AGE']<=100)]  # 1.2) Tag appropriate date within sp range > tag DIABETES code > combine tags df_dad_ageTrimmed['DAD_DATE_TAG'] = ft_obj.date_range_tagger(df_dad_ageTrimmed, 'ADMIT_DATE',      start_date_range=inclusion_start_date, end_date_range=inclusion_end_date, edge_date_inclusion=     edge_date_inclusion) df_dad_ageTrimmed['DAD_ICD_TAG'] = ft_obj.multi_var_cond_tagger(df_dad_ageTrimmed, repeat_var_base_name='DXCODE',      repeat_var_start=1, repeat_var_end=25, cond_list=[sp_dxCode_val_icd10]) df_dad_ageTrimmed['DAD_DATE_ICD_TAG'] = ft_obj.summing_all_tagger(df_dad_ageTrimmed, tag_var_list=['DAD_DATE_TAG',      'DAD_ICD_TAG'])  df_pc_ageTrimmed['PC_DATE_TAG'] = ft_obj.date_range_tagger(df_pc_ageTrimmed, 'SE_END_DATE',      start_date_range=inclusion_start_date, end_date_range=inclusion_end_date, edge_date_inclusion=     edge_date_inclusion) df_pc_ageTrimmed['PC_ICD_TAG'] = ft_obj.multi_var_cond_tagger(df_pc_ageTrimmed, repeat_var_base_name='HLTH_DX_ICD9X_CODE_',      repeat_var_start=1, repeat_var_end=3, cond_list=[str(sp_dxCode_val_icd9)]) df_pc_ageTrimmed['PC_DATE_ICD_TAG'] = ft_obj.summing_all_tagger(df_pc_ageTrimmed, tag_var_list=['PC_DATE_TAG',      'PC_ICD_TAG'])  # Output a list of all patients PHN_ENC who satisfy the Date and DIABETES code criteria df_dad_ageDateICDtrimmed = df_dad_ageTrimmed[df_dad_ageTrimmed['DAD_DATE_ICD_TAG']==1] df_pc_ageDateICDtrimmed = df_pc_ageTrimmed[df_pc_ageTrimmed['PC_DATE_ICD_TAG']==1]  dad_patientList_diabetes_Code = df_dad_ageDateICDtrimmed['PHN_ENC'].unique().tolist() pc_patientList_diabetes_Code = df_pc_ageDateICDtrimmed['PHN_ENC'].unique().tolist() dad_pc_patientList_diabetes_Code = list(set(dad_patientList_diabetes_Code)|set(pc_patientList_diabetes_Code)) dad_pc_patientList_diabetes_Code.sort()  # 1.3.1) Tag appropriate date within sp range > tag ABC injection code > combine tags df_pc_ageTrimmed['PC_PROC_TAG'] = df_pc_ageTrimmed['ABC_INJECT'] df_pc_ageTrimmed['PC_DATE_PROC_TAG'] = ft_obj.summing_all_tagger(df_pc_ageTrimmed, tag_var_list=['PC_DATE_TAG',      'PC_PROC_TAG']) df_pc_ageDateProcTrimmed = df_pc_ageTrimmed[df_pc_ageTrimmed['PC_DATE_PROC_TAG']==1]  pc_patientList_procCode = df_pc_ageDateProcTrimmed['PHN_ENC'].unique().tolist() dad_pc_patientList_diabetes_NprocCode = list(set(dad_pc_patientList_diabetes_Code)&set(pc_patientList_procCode)) dad_pc_patientList_diabetes_NprocCode.sort()  # 1.3.2) Find Index date df_pc_ageDateProcTrimmed_pivot = pd.pivot_table(df_pc_ageDateProcTrimmed, index=['PHN_ENC'],      values=['SE_END_DATE', 'AGE', 'SEX', 'RURAL'], aggfunc={'SE_END_DATE':np.min, 'AGE':np.min,     'SEX':'first', 'RURAL':'first'}) df_pc_ageDateProcTrimmed_pivot = pd.DataFrame(df_pc_ageDateProcTrimmed_pivot.to_records()) df_pc_ageDateProcTrimmed_pivot = df_pc_ageDateProcTrimmed_pivot.rename(columns={'SE_END_DATE':'INDEX_DT'})  # 1.4) Filter by valid registry # Create a list variable (based on index date) to indicate which fiscal years need to be valid according to     # the required 2 years before index and 1 year after index date, in df_pc_ageDateProcTrimmed_pivot def extract_needed_fiscal_years(row): # extract 2 years before and 1 year after index date     if int(row['INDEX_DT'].month) >= 4:         index_yr = int(row['INDEX_DT'].year)+1     else:          index_yr = int(row['INDEX_DT'].year)     first_yr = index_yr-2     four_yrs_str = str(first_yr)+','+str(first_yr+1)+','+str(first_yr+2)+','+str(first_yr+3)     return four_yrs_str  df_temp = df_pc_ageDateProcTrimmed_pivot.copy() df_temp['FYE_NEEDED'] = df_temp.apply(extract_needed_fiscal_years, axis=1) df_temp['FYE_NEEDED'] = df_temp['FYE_NEEDED'].apply(lambda x: x[0:].split(',')) # from whole string to list of string items df_temp['FYE_NEEDED'] = df_temp['FYE_NEEDED'].apply(lambda x: [int(i) for i in x]) # from list of string items to list of int items  # Create a list variable to indicate the active fiscal year, in df_reg df_reg['FYE_ACTIVE'] = np.where(df_reg['ACTIVE_COVERAGE']==1, df_reg['FYE'], np.nan) df_reg_agg = df_reg.groupby(by='PHN_ENC').agg({'FYE_ACTIVE':lambda x: list(x)}) df_reg_agg = df_reg_agg.reset_index() df_reg_agg['FYE_ACTIVE'] = df_reg_agg['FYE_ACTIVE'].apply(lambda x: [i for i in x if ~np.isnan(i)]) # remove float nan df_reg_agg['FYE_ACTIVE'] = df_reg_agg['FYE_ACTIVE'].apply(lambda x: [int(i) for i in x]) # convert float to int  # Merge df's and create tag, if active years do not cover all the required fiscal year, exclude patients # Create inclusion/exclusion patient list to apply to obtain patient cohort based on case def'n #1 df_temp_v2 = df_temp.merge(df_reg_agg, on='PHN_ENC', how='left') df_temp_v2_trimmed = df_temp_v2[(df_temp_v2['FYE_NEEDED'].notnull())&(df_temp_v2['FYE_ACTIVE'].notnull())] # Remove rows with missing on either variables  def compare_list_elements_btw_cols(row):     if set(row['FYE_NEEDED']).issubset(row['FYE_ACTIVE']):         return 1     else:         return 0  df_temp_v2_trimmed['VALID_REG'] = df_temp_v2_trimmed.apply(compare_list_elements_btw_cols, axis=1) df_temp_v2_trimmed_v2 = df_temp_v2_trimmed[df_temp_v2_trimmed['VALID_REG']==1] reg_patientList = df_temp_v2_trimmed_v2['PHN_ENC'].unique().tolist()  # Apply inclusion/exclusion patient list (from REG) to find final patients # Obtain final patient list df_final_defn1 = df_pc_ageDateProcTrimmed_pivot.merge(df_temp_v2_trimmed_v2, on='PHN_ENC', how='inner') df_final_defn1 = df_final_defn1[['PHN_ENC', 'AGE_x', 'SEX_x', 'RURAL_x', 'INDEX_DT_x']] df_final_defn1 = df_final_defn1.rename(columns={'AGE_x':'AGE', 'SEX_x':'SEX', 'RURAL_x':'RURAL', 'INDEX_DT_x':'INDEX_DT',}) df_final_defn1['PREINDEX_1Yr'] = (df_final_defn1['INDEX_DT']-pd.Timedelta(days=364)) # 364 because index date is counted as one pre-index date df_final_defn1['PREINDEX_2Yr'] = (df_final_defn1['INDEX_DT']-pd.Timedelta(days=729)) # 729 because index date is counted as one pre-index date df_final_defn1['POSTINDEX_1Yr'] = (df_final_defn1['INDEX_DT']+pd.Timedelta(days=364))  list_final_defn1 = df_final_defn1['PHN_ENC'].unique().tolist() dict_final_defn1 = {'Final unique patients of case definition #1':list_final_defn1}  # Additional ask (later on) # How: Create INDEX_DT_FIS_YR (index date fiscal year) by mapping INDEX_DT to fiscal year def index_date_fiscal_year(row):     if ((row['INDEX_DT'] >= datetime.datetime(2015, 4, 1, 00, 00, 00)) &         (row['INDEX_DT'] < datetime.datetime(2016, 4, 1, 00, 00, 00))):         return '2015/2016'     elif ((row['INDEX_DT'] >= datetime.datetime(2016, 4, 1, 00, 00, 00)) &         (row['INDEX_DT'] < datetime.datetime(2017, 4, 1, 00, 00, 00))):         return '2016/2017'     else:         return 'Potential error'  df_final_defn1['INDEX_DT_FIS_YR'] = df_final_defn1.apply(index_date_fiscal_year, axis=1)  # 2) Output final patient list for future access # WARNING: will overwrite existing if save_file_switch == True:     if df_subsampling_switch == True:         f = open(directory+resultDir+'_CaseDefn1_PatientDict_Subsample.txt',"w")         f.write(str(dict_final_defn1)+',')         f.close()         df_final_defn1.to_csv(directory+resultDir+'_CaseDefn1_PatientDf_Subsample.csv', sep=',', encoding='utf-8')     elif df_subsampling_switch == False:         f = open(directory+resultDir+'CaseDefn1_PatientDict_FINAL.txt',"w")         f.write(str(dict_final_defn1)+',')         f.close()         df_final_defn1.to_csv(directory+resultDir+'CaseDefn1_PatientDf_FINAL.csv', sep=',', encoding='utf-8')  # 3) Results: Analytic results if result_printout_switch == True:     print ('Unique PHN_ENC N, (aged 18 to 100 during inclusion period) from DAD:')     print (df_dad_ageTrimmed['PHN_ENC'].nunique())      print ('Unique PHN_ENC N, (aged 18 to 100 during inclusion period) from PC:')     print (df_pc_ageTrimmed['PHN_ENC'].nunique())      print ('Unique PHN_ENC N, (aged 18 to 100 during inclusion period) from DAD or PC:')     dd_obj = dd.Data_Comparator(df_dad_ageTrimmed, df_pc_ageTrimmed, 'PHN_ENC')     print (dd_obj.unique_n_union())      print ('Unique PHN_ENC N, (aged 18 to 100) and (had DIABETES code during inclusion period) from DAD:')     print (df_dad_ageDateICDtrimmed['PHN_ENC'].nunique())      print ('Unique PHN_ENC N, (aged 18 to 100) and (had DIABETES code during inclusion period) from PC:')     print (df_pc_ageDateICDtrimmed['PHN_ENC'].nunique())      print ('Unique PHN_ENC N, (aged 18 to 100) and (had DIABETES code during inclusion period) from DAD or PC:')     print (len(dad_pc_patientList_diabetes_Code))      print ('Unique PHN_ENC N, (aged 18 to 100) and (had DIABETES code during inclusion period)\ and (had ABC injection code) from DAD and PC:')     print (df_pc_ageDateProcTrimmed_pivot['PHN_ENC'].nunique())      print ('Unique PHN_ENC N, (aged 18 to 1005) and (had DIABETES code during inclusion period)\ and (had ABC injection code) and (had AB resident around index date) from DAD, PC, and REG [Case Def #1]:')     print (df_final_defn1['PHN_ENC'].nunique())      # Additional analytic ask (later on)     print ('Patient N by index date as corresponding fiscal year:')     print (df_final_defn1['INDEX_DT_FIS_YR'].value_counts())  if done_switch == True:     ctypes.windll.user32.MessageBoxA(0, b'Hello there', b'Program done.', 3) 

My questions are:

  • This is a code for a specific project, other projects from other companies while are the exactly the same, they usually follow similar overall steps including cleaning data, linking data, creating tags, filtering tags, aggregating data, saving files, and producing data. How can I refactor my code to be maintainable within this specific project, as well as reusable across similar projects?
  • Many times, once I have run the code and produce the results, clients may come back to ask for additional follow-up information (i.e., the ones under # Additional ask (later on)). How can I deal with additional asks more effectively with maintainability and expandability in mind?
  • Any areas I can try using some design patterns?
  • Any other suggestions on how I can write better python code are more than welcome.

Python: Combining Two Rows with Pandas read_excel

I am reading an Excel file using Pandas and I feel like there has to be a better way to handle the way I create column names. This is something like the Excel file I’m reading:

                1       2      # '1' is merged in the two cells above 'a'and 'b'     Date        a   b   c   d  #  likewise for '2'.  As opposed to 'centered across selection' 1   1-Jan-19    100 200 300 400 2   1-Feb-19    101 201 301 401 3   1-Mar-19    102 202 302 402 

I want my to merge the ‘a’,’b’,’c’,and’d’ columns heads with the ‘1’and ‘2’ above them, so I’m doing the following to get my headers the way that I want:

import pandas as pd import json  xls = pd.ExcelFile(r'C:\Path_to\Excel_Pandas_Connector_Test.xls') df = pd.read_excel(xls, 'Sheet1', header=[1])  # uses the abcd row as column names  #  I only want the most recent day of data so I do the following json_str = df[df.Date == df['Date'].max()].to_json(orient='records',date_format='iso')  dat_data = json.loads(json_str)[0]  def clean_json():     global dat_data     dat_data['1a']      = dat_data.pop('a')     dat_data['1b']      = dat_data.pop('b')     dat_data['2c']      = dat_data.pop('c')     dat_data['2d']      = dat_data.pop('d')  clean_json()  print(json.dumps(dat_data,indent=4)) 

My desired output is:

{ "Date": "2019-03-01T00:00:00.000Z", "1a": 102, "1b": 202, "2c": 302, "2d": 402 } 

This works as written, but is there a Pandas built-in that I could have used to do the same thing instead of the clean_json function?

Columna con ceros a la izquierda en pandas

Buen día, acabó de importar un csv sin delimitadores con pandas read_fwf, separando cada columna de acuerdo a su ancho.

Mi problema es que una de las columnas que genero es object y contiene datos de una fecha dd/mm/aaaa . Puntualmente si el dato es 03042019 los ceros pandas los ignora quedando 3042019. Que debó hacer para que esto no suceda? Probé con zerofill pero me gustaría saber si existe otra solución. Gracias de antemano.

Fast averaging over Pandas dataframe subsets

I’m trying to loop over a large number of trials and compute a weighted average for a number of subsets. Currently the data is in the long format with columns trial, area score.

  trial  area       score 0  T106     0     0.0035435 1  T106     1     0.0015967 2  T106     4     0.0003191 3  T106     4     0.1272919 4  T288     0     0.1272883 

I have about 120,000 trials, with 4 areas and maybe 10 to 100 scores per trial, for a total of ~7million rows. My first thought was to loop over all trials within a loop over the 4 areas, build a temp dataframe to compute the scores, and adding scores to an external dataframe:

for area in range(3):     for trial in trial_names.iloc[:,0]:           Tscore = 0         temp_trial = pd.DataFrame(trials_long.loc[(trials_long['tname'] == trial) & (trials_long['area'] == int(area))])         #match score in tria         temp_trial = temp_trial.merge(scores_df, how='left')         #sum score for all matching 'trial' +'area'                      #this will be weigted avrg, with >0.5 *2 and >0.9* 3         temp_trial.loc[temp_trial['score'] > 0.9, ['score']] *= 3        #weight 3x for  >0.9         temp_trial.loc[temp_trial['score'] > 0.5, ['score']] *= 2        #weight 2x for >0.5         Tscore = temp_trial['score'].sum() / int(len(temp_trial.index))         trial_names.loc[trial,area] = Tscore                    #store Tscore somewhere         Tscore = 0     print('done') 

Time is really of the essence in this case and the computations need to happen in under 15 seconds or so. In R I’d normally use a number of vectorized functions to skip the loops, and any loops I did have would be paralleled over multiple cores. I would also be open to learning something new, perhaps hash maps?


How to make two nested for loops more efficient in pandas

I have two different series in pandas that I have created a nested for loop which checks if the values of the first series is in the other series. But this is time consuming in pandas and I cannot work out how to change it to a pandas method. I thought to use the apply function but it did not work with method chaining. My original nested for loops look like so and they work;

for x in df_one['ser_one']:     print(x)     for y in df_two['ser_two']:         if 'MBTS' not in y and x in y:             print(y) 

Is there a way to make this less time consuming?

Here is what I attempted using apply methods;

df_two['ser_two'].apply(lambda x: x if 'MBTS' not in df_one['ser_one'].apply(lambda y:y) and x in df_one['ser_one'].apply(lambda y:y)) 

Pandas how to avoid apply in groupby nlargest n

Pandas apply is generally recommended not to be used. I have a situation here where I am interested if there are more efficient alternatives to the option of apply.

import numpy as np import pandas as pd  df = pd.DataFrame({'year': [1990,1990,1990,1992,1992,1992,1992,1993,1993,1993],                    'item': list('abcdefghij'),                   'value': [100,200,300,400,500,600,700,800,900,990]}) df 

I would like to get top 2 values for each year.

df.groupby('year')['value'].apply(lambda x: x.nlargest(2)).reset_index() 

Is there any alternative to this? Anything whether longer lines of codes or whatever!

pandas – adding values to a new column based on iteration of another column – what is the most elegant form of code?

i have the following dataframe

registration_datetime —-0———day_of_week





2013-01-05————- 793————-5


2013-01-07————- 954————-0




2013-01-11————- 989————-4

2013-01-12————- 791————-5

2013-01-13————-1635 ————-6



first of all , i’ve been struggling to change the name of the 0 column for some reason . any clues ?

second and more important – the second column (aka ‘0’) represents the value of some measurement at each day . i want to write a code that for a each day (for example Tuesday), will check if there was an increase or decrease of 20% in the measurement since last Tuesday. the answer should be noted as a corresponding value [-1,0,1] in a new column .

it means of course that the code will start iterating from the 8th day

so for example , in the data iv’e shown , there should be a new column :

‘drastic change’








(because ((1005-1257)/1257)*100 = -20 –>so the value should be0) -1

(because ((1112-1472)/1473)*100 = -24 –>so the value should be) -1

(because ((1270-1249)/1249)*100 = 1.6 –>so the value should be) 0

(because ((989-1094)/1094)*100 = -9 –>so the value should be) 0

(because ((791-793)/793)*100 = 0 –>so the value should be) 0

(because ((1635-1493)/1493)*100 = 9.5 –>so the value should be) 0

(because ((1620-954)/954)*100 = 70 –>so the value should be) 1

(because ((1260-1005)/1005)*100 = 25 –>so the value should be) 1

so ignoring my explanation in the brackets , the new column should show the values [-1,-1,0,0,0,0,1,1,…]

i am asking what is the most elegant way to do this ? i suppose as a data science package there might be automatic methods for this sort of thing ? if not , still i am hoping someone can suggest a code that i could play with

thanks a lot!

Ayuda con busqueda de similitudes en Pandas 0.23.4 (Python-3.6.5)

buenos días.

He investigado pero no encuentro como hacer lo siguiente:

Necesito buscar en un dataframe la palabra cas amarilla. El dataframe tendria campos en una columna como por ejemplo:

Apartamento sexto piso Finca avícola Finca agrícola Garaje Casas Casa dos plantas Casa tres plantas** 


El resultado de la búsqueda debería entregarme:

Casas Casa dos plantas Casa tres plantas 

Necesito que muestrere de cas amarilla todas los campos en donde hubo una coincidencia con las letras cas, que no importe si por ejemplo el usuario busca caza con z, que la búsqueda me muestre todo lo que contenga ca.

Ya con que me ensenen hacer esto es mucho, pero si también me pueden explicar como mostrar ese resultado en una ventana emergente en PyQt5, que el usuario seleccione uno de esos resultados y que ese resultado se guarde en un nuevo dataframe, me ayudarían muchísimo.

Gracias !