Is it okay to compare Test BLEU score between NMT models while using a slightly modified standard test sets?

I am using tst2013.en found at https://nlp.stanford.edu/projects/nmt/data/iwslt15.en-vi/ as my test sets to get the Test BLEU score to compare to other previous models. However, I have to filter out some sentences that are longer than 100 words otherwise I won’t have the resource to run the model.

But with a slightly modified test sets, is it acceptable to compare the Test BLEU score to other models that use the unmodified test sets?

Compare two lists and produce a list of names not duplicated

I have one list of total available people and the second list of people who have been assigned. I would like to auto-populate a third list of the people (from the first list) who have not been assigned (list B). Basically whichever names from column A that are not used in Column B would show up in Column C.

+----+----------+----------+----------+ |    | A        | B        | C        | +----+----------+----------+----------+ | 1  |   All    | Assigned |   Free   | +----+----------+----------+----------+ | 2  | AJ       | AJ       | Dayna    | +----+----------+----------+----------+ | 3  | Dayna    | Leah     | Kristina | +----+----------+----------+----------+ | 4  | Kristina | Mag      | Mai      | +----+----------+----------+----------+ | 5  | Leah     | Milla    | Sarah    | +----+----------+----------+----------+ | 6  | Mag      | Mimi     |          | +----+----------+----------+----------+ | 7  | Mai      | Oksana   |          | +----+----------+----------+----------+ | 8  | Milla    | Richelle |          | +----+----------+----------+----------+ | 9  | Mimi     |          |          | +----+----------+----------+----------+ | 10 | Oksana   |          |          | +----+----------+----------+----------+ | 11 | Richelle |          |          | +----+----------+----------+----------+ | 12 | Sarah    |          |          | +----+----------+----------+----------+ 

How to compare two fields and highlight only changes

I have such a scenario: When Item is edited on the list an email is sent with information about changes.

Current solution: We created an additional list called Archive that holds information only for comparison. We have a workflow that runs on edit and compares fields between the current version and the one on the Archive list. It generates an email with highlighted fields that changed (IF field ‘not equal’ field_archive). Then the workflow updates archive item for future changes.

Required changes: Currently, this comparison is only stating if anything has changed. We need to highlight exact changes for example: Field1: before change: This is a sample string. after change: This string has been changed. Information in email: This <strong>string has been changed</strong>

Field 2 checkboxes: before change: [x] checkbox1 [x] checkbox2 [ ] checkbox3

after change: [x] checkbox1 [ ] checkbox2 [x] checkbox3

info in email: checkbox1, <strong>checkbox3</strong>

What are the options when using workflows to achieve that – any other solution than sending fields to webservice?

How to efficiently get combinations of array of tuples [int, string] and compare it?

Just wondering if there is a way to run the piece of code below more efficiently. I just started to get acquainted with parallel program and think this might be an answer? But I have no idea how to do it using imap or processes…any help?

This is what the function is doing: it gets two arrays of tuples with 5 items each – hand = [H1,H2,H3,H4,H5] and board = [B1,B2,B3,B4,B5]

What I need to do is check all arrays formed by 2 items from hand and 3 items from board, like combination = [Hn,Hm,Bi,Bj,Bk] (100 combinations in total)

Then I need to compare each one of the combinations against a dictionary to get the combination rank, and then return the best array (best rank) and the rank itself:

def check_hand(hand, board, dictionary_A, dictionary_B):   best_hand = []  first_int = True   for h1 in range (0, 4):   for h2 in range (h1+1, 5):    for b1 in range (0, 3):     for b2 in range (b1+1, 4):      for b3 in range (b2+1, 5):       hand_check = []       hand_check.append(mao[m1])       hand_check.append(mao[m2])       hand_check.append(board[b1])       hand_check.append(board[b2])       hand_check.append(board[b3])       hand_check = sort(hand_check) #Custom sort for my array of objects       hand_ranks = "".join([str(hand_check[0].rank),str(hand_check[1].rank),str(hand_check[2].rank),str(hand_check[3].rank),str(hand_check[4].rank)])        if (hand_check[0].suit == hand_check[1].suit and hand_check[1].suit == hand_check[2].suit and hand_check[2].suit == hand_check[3].suit and hand_check[3].suit == hand_check[4].suit):        control = [dictionary_A[hand_ranks][0],dictionary_A[hand_ranks][1]]       else:        control = [dictionary_B[hand_ranks][0],dictionary_B[hand_ranks][1]]        if first_int:        best_hand = hand_check        rank = control        first_int = False       elif (int(control[0]) > int(rank[0])):        rank = control        best_hand = hand_check              elif (int(control[0]) == int(rank[0])):        if (int(control[1]) > int(rank[1])):                rank = control         best_hand = hand_check         return best_hand, rank[0] 

I need to run this check for 2 million different hands and interact over 1000 times for every hand (Ideally I would run it for at least 100000 times for every hand, for a more statistically accurate result). Any ideas on how to make it more efficient?

Examples:

For hand = [[‘2′,’s’],[‘5′,’h’],[‘7′,’h’],[‘8′,’c’],[‘T’,s’]] and board = [[‘3′,’s’],[‘3′,’h’],[‘9′,’s’],[‘T’,’c’],[‘6′,s’]]

It compares every combination (2s5h3s3h9s,2s5h3sTc,…., 7hTs3s9s6s,…,8cTs9sTc6s), compares to my dictionaries and returns best_hand = [[‘6′,’s’],[‘7′,’h’],[‘8′,’c’],[‘9′,’s’],[‘T’,’c’]], rank[0] = 5 (from dictionary) which is the best poker hand for this case.

For hand = [[‘8′,’s’],[‘9′,’h’],[‘9′,’c’],[‘A’,’c’],[‘A’,s’]] and board = [[‘7′,’s’],[‘6′,’h’],[‘T’,’s’],[‘T’,’c’],[‘A’,h’]] it will return best_hand = [[‘T’,’s’],[‘T’,’c’],[‘A’,’s’],[‘A’,’h’],[‘A’,’c’]], rank[0] = 7

Compare formulas, which is most efficient

Is there a way to test speed/efficiency of formulas in Excel ?

This can be done in SQL Server by showing the actual excecution plan, is there anything that Excel can do similar.

For example, the below queries will all give me the same result,

{=SUM(SUMIF(A:A,D1:D3,B:B))}  =SUMPRODUCT(ISNUMBER(MATCH(A1:A25,D1:D3,0))*B1:B25)  =SUMIF(A1:A25,D2,B1:B25)+SUMIF(A1:A25,D3,B1:B25)+SUMIF(A1:A25,D1,B1:B25) 

If I was to use these formulas over 1000s of rows my worksheet will take longer and longer to calculate.

How can I determine which is the most efficient to use ?

The problem doesn’t just relate to these examples, is there a way in general of finding out the execution/performance ?

compare bitcoin prices at different exchanges in one chart

Im having a hard time finding an online place where i can see the prices of bitcoin (or other coins) – from different exchanges.

I mention i want a chart. And the ability to look into historic data.

I see that coinmarketcap offers the ability to see BTC prices at different exchanges – but it doesn’t construct a chart from all of that, neither do i see how to get historic data. The chart they offer is an average of the price of bitcoin based on many exchanges.

I want to see this BTC prices on the same chart separately. What i want is that each exchange price makes its own graph – and the selected exchanges appear superimposed over the same chart.

Fake example:

   BTC binance yesterday was 5100    BTC bitfinex yesterday was 5115     BTC binance today is 5080    BTC bitfinex today is 5090 

So this chart will have 2 separate graphs – BTC binance graph and BTC biffinex graph.

So what can be said is that binance is a better place to buy BTC since given last 2 days, btc was cheaper by 10. But is this holding true over longer periods of time? What about other exchanges?

I want to see how this difference in prices between exchanges varyes over time.

Is this service available anywere, in the way im describing it here? Thanks.

Compare every combination of variables within the powerset

I have some survey data that represents individual’s responses to multiple survey questions. There are about 10,000 people in my actual dataset, and each person answered 35 questions. From these 35 survey questions, I create 1 composite score for each individual by taking the average of the values of all the questions that person answered. I am looking to see if I can find a subset of less than all 35 variables that produces a composite score that is highly correlated with the scores individuals receive if I use all of the questions.

Essentially, I want to be able identify which questions you should ask, if you can only ask X questions, and still end up with a composite score that is similar to if you asked all the questions.


I have created an example dataset, but with only 1,000 individuals and 10 questions. I have then written the code below to identify every subset of variables and compare the correlation of individual’s scores when using just that subset to individual’s scores when using all of the variables.

While my method below works when there are 1,000 individuals and 10 questions, it does not scale to my reality of 10,000 individuals and 35 questions, because of all the possible combinations of variables (2^35 = 34,359,738,368).

(Note: I have simplified my question and the sample data below for the purpose of this question.)


Code on GitHub: https://github.com/CurtLH/variable_subset/blob/master/powerset%20correlations.ipynb


import random import numpy as np import pandas as pd from itertools import chain, combinations  # initial parameters random.seed(1234) num_ids = 1000 num_vars = 10  # create repeating IDs ids = sorted(list(range(0, num_ids)) * num_vars)  # create repeating variables variables = list(range(0, num_vars)) * num_ids  # create random integers values = np.random.randint(1, 5, size = num_ids * num_vars)  # create a dataframe with these values df = pd.DataFrame({"id": ids, "variable": variables, "values": values})  # sort the dataframe df.sort_values(['id', 'variable'], inplace=True)  def powerset(iterable):     """     Thanks to https://stackoverflow.com/questions/1482308     """     s = list(iterable)     return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))  # create every subset of variables subsets = list(powerset(set(variables)))  # calculate the average value by ID when including all variables actual = df.groupby('id')['values'].mean()  def calculate(df, subsets):      """     1. Iterate over each permutation of variables     2. Subset the dataframe to only include those variables     3. Group by ID and recalculate the mean values per ID     4. Measure correlation with complete set of variables     """      # create a dictionary to hold the results     results = {}      # iterate over each subset     for s in subsets:          # make sure there is at least 1 variable...         if len(s) > 0:              # filter the dataframe to only the variables in the subset             sub = df[df['variable'].isin(s)]              # group by ID and calculate average value             scores = sub.groupby('id')['values'].mean()              # calculate correlation compared to complete set of variables             corr = actual.corr(scores)              # add results to dictionary             results[s] = {'num_items': len(s),                           'correlation': corr}      return results  # time how long it takes to run %%timeit results = calculate(df, subsets) 

Timing results:

  • 2.23 s ± 3.91 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)