Transpose a row in a DataFrame into a binary matrix

Context

Lets say I have a pandas-DataFrame like this:

>>> data.head()                             values  atTime date         2006-07-01 00:00:00+02:00   15.10   0000 2006-07-01 00:15:00+02:00   16.10   0015 2006-07-01 00:30:00+02:00   17.75   0030 2006-07-01 00:45:00+02:00   17.35   0045 2006-07-01 01:00:00+02:00   17.25   0100 

atTime represents the hour and minute of the timestamp used as index. I want to transpose the atTime-column to a binary matrix (making it sparse is also an option), which will be used as nominal feature in a machine learning approach.

The desired result should look like:

>>> data.head()                             values  0000  0015  0030  0045  0000 date         2006-07-01 00:00:00+02:00   15.10   1     0     0     0     0 2006-07-01 00:15:00+02:00   16.10   0     1     0     0     0 2006-07-01 00:30:00+02:00   17.75   0     0     1     0     0 2006-07-01 00:45:00+02:00   17.35   0     0     0     1     0 2006-07-01 01:00:00+02:00   17.25   0     0     0     0     1 

As migth be anticipated, this matrix will be much larger.

My question

I can achieve the desired result with workarounds using apply and using the timestamps in order to create the new columns beforehand.

However, is there a build-in option in pandas (or via numpy, concidering atTime as numpy-array) to achieve the same without a workaround?

How to apply a function on a DataFrame Column using multiple rows and columns as input?

I have a sequence of events, and based on some variables (previous command, previous/current code and previous/current status) I need to decide which command is related to that event.

I actually have a code that works as expected, but it’s kind of slow. So

def mark_commands(df):     for i in range(1, len(df)):         prev_command = df.loc[i-1, 'Command']         prev_code, cur_code = df.loc[i-1, 'Code'], df.loc[i, 'Code']         prev_status, cur_status = df.loc[i-1, 'Status'], df.loc[i, 'Status']          if (prev_command == "end" and              ((cur_code == 810 and cur_status in [10, 15]) or              (cur_code == 830 and cur_status == 15))):              df.loc[i, 'Command'] = "ignore"          elif ((cur_code == 800 and cur_status in [20, 25]) or              (cur_code in [810, 830] and cur_status in [10, 15])):              df.loc[i, 'Command'] = "end"          elif ((prev_code != 800) and              ((cur_code == 820 and cur_status == 25) or              (cur_code == 820 and cur_status == 20 and                  prev_code in [810, 820] and prev_status == 20) or              (cur_code == 830 and cur_status == 25 and                  prev_code == 820 and prev_status == 20))):              df.loc[i, 'Command'] = "continue"          else:              df.loc[i, 'Command'] = "begin"      return df 

¿Cómo calcular días laborales entre dos fechas en dataframe python?

Buen día, necesito calcular del número de días transcurridos entre dos fechas, y que el resultado sea solo en días hábiles y/o laborales (Sin Sábados ni Domingos), actualmente tengo lo siguiente:

#DIFERENCIA DE FECHAS 'Fecha1' - 'Fecha2' df['DIAS_TRANSCURRIDOS'] = (df.Fecha1 - df.Fecha2) / pd.Timedelta('1 day')  df.head() 

Gracias.

Panda dataframe Creation a new column by comparing all other row

I have the following example:

def function(value,df):     return len(df[(df['A']<value)])  df= pd.DataFrame(0, index=np.arange(30000), columns=['A']) df['A']=df.index.values  start=time.time() df['B']=pd.Series([len(df[df['A']<value]) for value in df['A']]) end=time.time() print("time:",end-start)  start=time.time() df['B']=df['A'].apply(function,df=df) end=time.time() print("time:",end-start)  start=time.time() series = [] for index, row in df.iterrows():     series.append(len(df[df['A']<row['A']])) df['B'] = series end=time.time() print("time:",end-start) 

Output:

time: 19.54859232902527 time: 23.598857402801514 time: 26.441001415252686 

This example create a new column by counting all the row which value is superior to the current value of the row.

For this type of issue (when I created a new column, after comparing for a row all other row of the dataframe), I have tried the apply function,list comprehension and classic loop but I think they are slow.

Is there a faster way?

Ps: A specialized solution for this example is not the thing which interested me the most. I prefer a general solution for this type of issue.

An another example can be: for a dataframe with a columns of string,create a new column by counting for each row the number of string in the dataframe which begin by the string first letter.

Add a new column to a dataframe with the percentage of every value of this dataframe

I have the dataframe below and I would to add a new column with the percentage of every value of this dataframe. Something like:

name<-c("asdad","dssdd") number<-c(5,5) df<-data.frame(name,number)  for (i in 1:nrow(df)){ percentage<-df[i,1]/sum(df$  number) }  new<-cbind(df, percentage) 

but I get NAs instead of percentages.

Loop on dataframe takes a lot of time

The dataframe subset feature is being used in a for loop across the dataframe rows. The result seems accurate however, the time taken to complete the loop on 2000 odd rows is more than 4 minutes. Any advice or guidance on the quality of the code?

Datasets:  DF1 input   customer_id 31-12-2019 00:00    31-12-2018 00:00    31-12-2017 00:00    31-12-2016 00:00    31-12-2015 00:00    31-12-2014 00:00    31-12-2013 00:00    31-12-2012 00:00    31-12-2011 00:00    31-12-2010 00:00     70464016                                             70453975                                             79983381                                             76615995                                             73543785                                             78226476                                             70117143                                             76448285                                             73980212                                             74540790      File input upload_date customer_id date    rating  rating_agency 05-03-2019  70464016    31-Dec-18   3   INTERNAL 05-03-2019  70453975    31-Dec-18   4+  INTERNAL 05-03-2019  79983381    31-Dec-18   3   INTERNAL 05-03-2019  76615995    31-Dec-18   4   INTERNAL 05-03-2019  73543785    31-Dec-18   4   INTERNAL 05-03-2019  78226476    31-Dec-18   4   INTERNAL 05-03-2019  70117143    31-Dec-18   4-  INTERNAL 05-03-2019  76448285    31-Dec-18   4-  INTERNAL 05-03-2019  73980212    31-Dec-18   5   INTERNAL 05-03-2019  74540790    31-Dec-18   5   INTERNAL 05-03-2019  76241783    31-Dec-18   5   INTERNAL 05-03-2019  76323368    31-Dec-18   5+  INTERNAL 05-03-2019  70732832    31-Dec-18   5   INTERNAL 05-03-2019  70453263    31-Dec-18   4-  INTERNAL 05-03-2019  73807515    31-Dec-18   5   INTERNAL 05-03-2019  71584306    31-Dec-18   5+  INTERNAL 05-03-2019  71017190    31-Dec-18   5   INTERNAL 05-03-2019  79142410    31-Dec-18   5   INTERNAL 05-03-2019  70455229    31-Dec-18   5   INTERNAL 

The code is as follows:

for j in df1.itertuples(index=True, name='Pandas'):     for i in range(1,len(df1.columns)):         #for j in range(len(df1)):             flag = file[(file['customer_id'] == j.customer_id) & (file['year'] == df1.columns[i].year)]             flag = flag[(flag['date']== flag['date'].max())]              if len(flag) != 0:                 df1.iat[j.Index,i] = flag.rating.iloc[0]             else:                 pass  

What is the correct syntax for iterating over records in a dataframe or pytable?

import pandas as pd from pandas import DataFrame import tables as pytb  with pytb.open_file('debug_counts.h5', mode='r') as h5file:      table = h5file.get_node('/tbl_main')       print("number of rows in table =", table.nrows)       i = 0      j = 0     for row in table:         j += 1         if row['symbol'] == b"foo":                        i += 1        print("table all records count =", j)       print("table foo records count =", i)        df = pd.DataFrame.from_records(table.read_where('(symbol == b"foo")'))       print("dataframe size =", df.size)          i = 0     for index, row in df.iterrows():         i += 1      print("dataframe records count =", i)          i = 0      for record in table.where('(symbol == b"foo")'):         i += 1      print("table.where records count =", i)      h5file.close() 

Output:

runfile('G:/$  HDF5/debug_counts.py', wdir='G:/$  HDF5') number of rows in table = 2826254 table all records count = 2826254 table foo records count = 37920 dataframe size = 985920 dataframe records count = 37920 table.where records count = 37920 

The larger numbers are all correct. The 37920 numbers are incorrect, or at least not what I want. How do I get the output I’m looking for (985920, not 37920), and where does the 37920 come from?