Como inverter a ordem das colunas de um Dataframe com Python

Opa, gostaria de saber como posso inverter colunas inteiras com usando python.

FRUTA   |   VITAMINA   |   PREÇO LARANJA |      C       |   2.00 MAÇÃ    |      B1      |   2.00 BANANA  |      B2      |   1.00 

Gostaria de saber como poss transformar a coluna anterior nessa coluna de agora:

PREÇO   |   VITAMINA   |   FRUTA 2.00    |      C       |   LARANJA 2.00    |      B1      |   MAÇÃ 1.00    |      B2      |   BANANA 

Quero apenas mudar a coluna inteira com todos os valores, como posso fazer isso? Grato =)

Select rows from dataframe that their specific column value equal the column of another dataframe

I have df1 which has:

c1 c2 c3 c4 c5 42 32 75 23 wq 63 2332 343 232 tr 87 675 305 2 iu 15 82 10 33 tr 

And df2 which has:

c1 c2 c3 c4 c5 87 42 97 53 fe 74 22 98 223 wq 87 675 784 321 iu 31 92 84 865 wq 

I want to return the rows of df2 in which the values of c5 in df2 exist in c5 of df1

So, as a result; the output would be like that:

result

c1 c2 c3 c4 c5 74 22 98 223 wq 87 675 784 321 iu 31 92 84 865 wq 

Fastest way to find dataframe indexes of column elements that exist as lists

I asked this question here: https://stackoverflow.com/q/55640147/5202255 and was told to post on this forum. I would like to know whether my solution can be improved or if there is another approach to the problem. Any help is really appreciated!

I have a pandas dataframe in which the column values exist as lists. Each list has several elements and one element can exist in several rows. An example dataframe is:

X = pd.DataFrame([(1,['a','b','c']),(2,['a','b']),(3,['c','d'])],columns=['A','B'])  X =   A          B 0  1  [a, b, c] 1  2  [a, b] 2  3     [c, d] 

I want to find all the rows, i.e. dataframe indexes, corresponding to elements in the lists, and create a dictionary out of it. Disregard column A here, as column B is the one of interest! So element ‘a’ occurs in index 0,1, which gives {‘a’:[0,1]}. The solution for this example dataframe is:

Y = {'a':[0,1],'b':[0,1],'c':[0,2],'d':[2]} 

I have written a code that works fine, and I can get a result. My problem is more to do with the speed of computation. My actual dataframe has about 350,000 rows and the lists in the column ‘B’ can contain up to 1,000 elements. But at present the code is running for several hours! I was wondering whether my solution is very inefficient. Any help with a faster more efficient way will be really appreciated! Here is my solution code:

import itertools import pandas as pd X = pd.DataFrame([(1,['a','b','c']),(2,['a','b']),(3,['c','d'])],columns=['A','B']) B_dict = [] for idx,val in X.iterrows():     B = val['B']     B_dict.append(dict(zip(B,[[idx]]*len(B))))     B_dict = [{k: list(itertools.chain.from_iterable(list(filter(None.__ne__, [d.get(k) for d in B_dict])))) for k in set().union(*B_dict)}]  print ('Result:',B_dict[0]) 

Output

Result: {'d': [2], 'c': [0, 2], 'b': [0, 1], 'a': [0, 1]} 

The code for the final line in the for loop was borrowed from here https://stackoverflow.com/questions/45649141/combine-values-of-same-keys-in-a-list-of-dicts, and https://stackoverflow.com/questions/16096754/remove-none-value-from-a-list-without-removing-the-0-value

Fast averaging over Pandas dataframe subsets

I’m trying to loop over a large number of trials and compute a weighted average for a number of subsets. Currently the data is in the long format with columns trial, area score.

  trial  area       score 0  T106     0     0.0035435 1  T106     1     0.0015967 2  T106     4     0.0003191 3  T106     4     0.1272919 4  T288     0     0.1272883 

I have about 120,000 trials, with 4 areas and maybe 10 to 100 scores per trial, for a total of ~7million rows. My first thought was to loop over all trials within a loop over the 4 areas, build a temp dataframe to compute the scores, and adding scores to an external dataframe:

for area in range(3):     for trial in trial_names.iloc[:,0]:           Tscore = 0         temp_trial = pd.DataFrame(trials_long.loc[(trials_long['tname'] == trial) & (trials_long['area'] == int(area))])         #match score in tria         temp_trial = temp_trial.merge(scores_df, how='left')         #sum score for all matching 'trial' +'area'                      #this will be weigted avrg, with >0.5 *2 and >0.9* 3         temp_trial.loc[temp_trial['score'] > 0.9, ['score']] *= 3        #weight 3x for  >0.9         temp_trial.loc[temp_trial['score'] > 0.5, ['score']] *= 2        #weight 2x for >0.5         Tscore = temp_trial['score'].sum() / int(len(temp_trial.index))         trial_names.loc[trial,area] = Tscore                    #store Tscore somewhere         Tscore = 0     print('done') 

Time is really of the essence in this case and the computations need to happen in under 15 seconds or so. In R I’d normally use a number of vectorized functions to skip the loops, and any loops I did have would be paralleled over multiple cores. I would also be open to learning something new, perhaps hash maps?

Thanks!

slicing and dicing a pandas dataframe

I would like to reshape a data frame object in multiple steps by first removing the first several rows of data, second setting a new index, and then lastly choosing which columns to include. The dataframe object is parsed from an excel file and includes header information that i would like to separate from the data frame values in the lower part of the sheet. Any pointers to libraries and or specific components of libraries would be appreciated. Links to documentation would be most helpful.

Заменить значения в одном столбце в зависимости от значений в нескольких других столбцах в dataframe

Исходный data frame:

введите сюда описание изображения

Необходимо по условию заменить значения в колонке SKU: если SKU == 207041 и warehouse == MSC и client == Тандер, то SKU = 916041

ожидаю получить:

введите сюда описание изображения

Sharing Pandas dataframe between processes using multiprocessing in Python

I have a minimal example of multiprocessing where the expected output is a shared pandas dataframe. However, it seems that the four parallel processes stop before updating the dataframe during their first task, so updating the shared dataframe seems to be killing the processes. In my example, 10 text files are first created for testing purposes that each contain a single integer corresponding to the file name. The “analyze_file” function is given each of the 10 file paths and the namespace for variable sharing, and then it enters “result” (the sum of the integer value given in the files and each of the constants in the list called “constants”) into the appropriate place in the dataframe. I am attempting to use the namespace method for sharing the dataframe, but I must be incorrectly using it.

Any ideas about getting the dataframe to be updated after each task, and get variable sharing to work? Am I making a simple mistake? I am trying to follow the method given here: https://stackoverflow.com/questions/19887087/how-to-share-pandas-dataframe-object-between-processes

from multiprocessing import Manager import multiprocessing as mp import pandas as pd import os  test_folder = r'C:\test_files' test_filenames = ['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten'] constants = [10, 15, 30, 60, 1440]  ct = 1  for filename in test_filenames:     with open(test_folder + '\' + filename + '.txt', 'w') as f:         f.write(str(ct))     f.close()          ct += 1  def analyze_file(file_path, ns):      with open(file_path) as f:         value = int(f.readline())     f.close()      filename = file_path.split( '\' )[-1]         for constant in constants:         result = value + constant          ns.df.at[constant, filename] = result  def worker_function(file_paths, ns):     for file_path in file_paths:         analyze_file(file_path, ns)  def run_parallel(file_paths, number_procs, ns):      procs = []     for i in range(number_procs):         paths_load = file_paths[i::number_procs]         proc = mp.Process(target=worker_function, args=(paths_load, ns))         procs.append(proc)         procs[i].start()     for p in procs:         p.join()  if __name__ == '__main__':      num_procs = 4     files = os.listdir(test_folder)     file_paths = [test_folder + '\' + file for file in files]     output_df = pd.DataFrame(columns=files, index=constants)      mgr = Manager()     ns = mgr.Namespace()     ns.df = output_df      run_parallel(file_paths, num_procs, ns)      output_df = ns.df 

AnalysisException is thrown when the DataFrame is empty (No such struct field)

I have a dataframe on which I apply a filter and then a series of transformations. At the end, I select several columns.

//  Filters the event related to a user_principal.   var filteredCount = events.filter("Properties.EventTypeName == 'user_principal_created' or Properties.EventTypeName == 'user_principal_updated'");                             // Selects the columns based on the event type.                             .withColumn("Username", when(col("Properties.EventTypeName") === lit("user_principal_created"), col("Body.Username"))                             .otherwise(col("Body.NewValue.Username")))                             .withColumn("FirstName", when(col("Properties.EventTypeName") === lit("user_principal_created"), col("Body.FirstName"))                             .otherwise(col("Body.NewValue.FirstName")))                             .withColumn("LastName", when(col("Properties.EventTypeName") === lit("user_principal_created"), col("Body.LastName"))                             .otherwise(col("Body.NewValue.LastName")))                             .withColumn("PrincipalId", when(col("Properties.EventTypeName") === lit("user_principal_created"), col("Body.PrincipalId"))                             .otherwise(col("Body.NewValue.PrincipalId")))                             .withColumn("TenantId", when(col("Properties.EventTypeName") === lit("user_principal_created"), col("Body.TenantId"))                             .otherwise(col("Body.NewValue.TenantId")))                             .withColumnRenamed("Timestamp", "LastChangeTimestamp")                             // Create the custom primary key.                             .withColumn("PrincipalUserId", substring(concat(col("TenantId"), lit("-"), col("PrincipalId")), 0, 128))                                                        // Select the rows.                             .select("PrincipalUserId", "TenantId", "PrincipalId", "FirstName", "LastName", "Username", "LastChangeTimestamp") 

It works only if the filter does return rows. If no row matches the filter, then the select clause fail:

org.apache.spark.sql.AnalysisException: No such struct field Username in…

Question

What can I do to handle such scenario and prevent the select from failing?

Repeat a row of dataframe n number of times in a list of dataframes

How can I repeat and bind a row of a dataframe n number of times in each dataframe of a list? So, for example with this list:

[[1]]   x z y  1 2 3  [[2]]   x z y  4 5 6 

this is the desired output if n is 1:

[[1]]   x z y  1 2 3  1 2 3 [[2]]   x z y  4 5 6  4 5 6 

Data:

list1 <- data.frame("x" = 1, "z" = 2, "y" = 3)    list2 <- data.frame("x" = 4, "z" = 5, "y" = 6)     list <- list(list1, list2)    

write/read efficiently dataframe objects into memory or disk?

i’m running a for loop that loops over all the rows of a pandas dataframe, then it calculates the euclidean distance from one point at a time to all the other points in the dataframe, then it pass the following point, and do the same thing, and so on.

The thing is that i need to store the value counts of the distances to plot a histogram later, i’m storing this in another pandas dataframe. The problem is that as the second dataframe gets bigger, i will run out of memory at the some time. Not to mention that also as the dataframe size grows, repeating this same loop will get slower, since it will be heavier and harder to handle in memory.

This is the original piece of code i was using:

counts = pd.DataFrame()  for index, row in df.iterrows():      dist = pd.Series(np.sqrt((row.xx - df.xx)**2 + (row.yy - df.yy)**2 + (row.tt - df.tt)**2))     counter = pd.Series(dist.value_counts( sort = True)).reset_index().rename(columns = {'index': 'values', 0:'counts'})        counts = counts.append(counter) 

The original df has a shape of (695556, 3) so the expected result should be a dataframe of shape (695556**3, 2) containing all the distance values from all the 3 vectors, and their counts. The problem is that this is impossible to fit into my 16gb ram.

So i was trying this instead:

for index, row in df.iterrows():     counts = pd.DataFrame()     dist = pd.Series(np.sqrt((row.xx - df.xx)**2 + (row.yy - combination.yy)**2 + (row.tt - df.tt)**2))     counter = pd.Series(dist.value_counts( sort = True)).reset_index().rename(columns = {'index': 'values', 0:'counts'})        counts = counts.append(counter)     counts.to_csv('counts/count_' + str(index) + '.csv')     del counts 

In this version, instead of just storing the counts dataframe into memory, i’m writting a csv for each loop. The idea is to put it all together later, once it finishes. This code works faster than the first one, since the time for each loop won’t increment as the dataframe grows in size. Although, it still being slow, since it has to write a csv each time. Not to say it will be even slower when i will have to read all of those csv’s into a single dataframe.

Can anyone show me how i could optimize this code to achieve these same results but in faster and more memory efficient way?? I’m also open to other implementations, like, spark, dask, or whatever way to achieve the same result: a dataframe containing the value counts for all the distances but that could be more or less handy in terms of time and memory.

Thank you very much in advance