## Difference between coefficients of svm for multiclass dataset and same ovr datasets

I trained iris dataset (including 3 classes) with one-vs-all svm classifier and print coef_ and intercept_ as bellow:

``[ [-0.04625854  0.5211828  -1.00304462 -0.46412978] // 1-vs-all coefficients  [-0.00722313  0.17894121 -0.53836459 -0.29239263] // 2-vs-all coefficients  [ 1.15034043  1.14954525 -3.53985244 -4.24622393]] //3-vs-all coefficients  [  1.4528445    1.50771313  13.63764975] //intercepts ``

Then I created ovr iris datasets. I labeled specific class as 1 and the other two classes as 0. again I trained whole dataset with same classifire and print coef_ and intercept. here are results:

``1-vs-all: [[-0.04575352  0.52216766 -1.00294058 -0.46406882]] [ 1.44746413]  2-vs-all: [[-0.03070975 -2.38286314  1.13998914 -2.61285489]] [ 5.48399354]  3-vs-all: [[-1.15034043 -1.14954525  3.53985244  4.24622393]] [-13.63764975] ``

As you can see absolute value of 1-vs-all and 3-vs-all results are same in both experiments but for 2-vs-all it is completely different. I can’t recognize why is this happening?

## ¿Como implementar KNN y evaluarlo en un dataset Iris?

Tengo el siguiente dataset iris en Python y necesito crear el KNN (K-Nearest Neighbors) para que me evalué el dataset iris…

``from mpl_toolkits.mplot3d import Axes3D from sklearn.datasets import load_iris  X, y = load_iris(return_X_y=True)  print(X.shape) print(y.shape)  fig = plt.figure(figsize=(20, 8)) ax1 = fig.add_subplot(1, 2, 1, projection='3d')  ax1.set_title("Dataset Iris: Dimensiones 1, 2 y 3 ") ax1.scatter(X[:, 0], X[:, 1], X[:, 2], c=y, cmap=plt.cm.Paired) ax1.set_xlabel('Sepal length') ax1.w_xaxis.set_ticklabels ax1.set_ylabel('Sepal width') ax1.w_yaxis.set_ticklabels ax1.set_zlabel('Petal length') ax1.w_zaxis.set_ticklabels  ax2 = fig.add_subplot(1, 2, 2, projection='3d') ax2.set_title("Dataset Iris: Dimensiones 1, 2 y 4 ") ax2.scatter(X[:, 0], X[:, 1], X[:, 3], c=y, cmap=plt.cm.Paired) ax2.set_xlabel('Sepal length') ax2.w_xaxis.set_ticklabels ax2.set_ylabel('Sepal width') ax2.w_yaxis.set_ticklabels ax2.set_zlabel('Petal width') ax2.w_zaxis.set_ticklabels  plt.show() ``

``def knn_predict(X): ``

que adentro tenga la implementación del KNN

## How to create a MACRO to create a JSON file from dataset

I have some problems when I’m trying to generate a JSON file from dataset on SAS GUIDE. I generated a TEST.JSON:

``{"TP_SMS":"1"  "NM_REMETENTESMS":"00000159"}, {"TP_SMS":"2"  "NM_REMETENTESMS":"00000159"}, {"TP_SMS":"3"  "NM_REMETENTESMS":"00000159"}, {"TP_SMS":"4"  "NM_REMETENTESMS":"00000159"}, {"TP_SMS":"5"  "NM_REMETENTESMS":"00000159"}, . . . {"TP_SMS":"9"  "NM_REMETENTESMS":"00000159"}, ``

The field TP_SMS is filled correct, but the second field is wrong – they are considering just the last position from my table.

Below there is my code white macro:

`` data teste30;   set MATABLES.EXIT_DATA;  RESP=cat(CD_CLIENTE,"|",ANWER_DATA);  ID=_N_;  call symputx('ID',ID);  call symputx('CD_CLIENTE',CD_CLIENTE);  call symputx('NM_PRIMNOMECLIENTE',NM_PRIMNOMECLIENTE);  call symputx('RESP',RESP);  call symputx('msgtext',msgtext); run;         %macro MontaJSON(ID);  WRITE OPEN OBJECT;     WRITE VALUES "TP_SMS" "&ID";      WRITE VALUES "NM_REMETENTESMS" "&CD_CLIENTE";  WRITE CLOSE;  %mend MontaJSON(ID);  %macro SMSRecords;    %do i = 1 %to &dim_IDs;    %MontaJSON(&&&ID_&i); %end; %mend SMSRecords;   proc sql; select id, CD_CLIENTE into :ID_1 - :ID_&SysMaxLong from work.teste30; %let dim_IDs = &sqlObs; quit;    proc json out="C:\TEMP\TEST.json" pretty nokeys nosastags;     write open array; /* container for all the data */     %SMSRecords;     write close;    /* container for all the data */ run; ``

I expect this macro get all datas on sequence, as TP_SMS code:

``{"TP_SMS":"1"  "NM_REMETENTESMS":"00014578"}, {"TP_SMS":"2"  "NM_REMETENTESMS":"21323445"}, {"TP_SMS":"3"  "NM_REMETENTESMS":"23456753"}, {"TP_SMS":"4"  "NM_REMETENTESMS":"00457663"}, {"TP_SMS":"5"  "NM_REMETENTESMS":"00014795"}, {"TP_SMS":"6"  "NM_REMETENTESMS":"00014566"}, {"TP_SMS":"7"  "NM_REMETENTESMS":"00014578"}, {"TP_SMS":"8"  "NM_REMETENTESMS":"00000122"}, {"TP_SMS":"9"  "NM_REMETENTESMS":"00000159"} ``

Does anyone has some idea to solve it?

Tks

## How to check for ID uniqueness on large dataset that can’t fit into memory or on a single disk

Say I have 500MB of memory to work with, but 100 terabytes of IDs I want to generate. I want these IDs (GUIDs) to be randomly selected and applied to records, so they shouldn’t appear in order when selected. Also, I have 1 billion petabytes (in this example) of possible IDs, which is way too many to actually generate. So I only select 100 terabytes initially out of the possible space. I’m not sure how many IDs that would be, but say it’s on the order of 10^32 or something large (I don’t know the exact number). So basically, this is the situation:

1. A gigantic space of possible IDs (10^32, or more precisely I am focusing on the number being 1 billion petabytes, an unreasonably large amount of data).
2. A subset of these IDs which are randomly selected. In this case, 100 terabytes of IDs.
3. A computer that is limited to 500MB or so, so all the possible IDs needed can’t possibly fit into memory.
4. The computer only has 10GB of disk space.

The question is how to architect a fast system for generating these IDs.

After some thought my approach was to consider generating a Trie. Say the IDs are 32 characters matching `/[0-9]/`. Then we would generate the equivalent of 100 terabytes in the trie, 1 character at a time. This removes the need of storing duplicate characters by probably a few orders of magnitude of memory. But still it would require about 100 terabytes of memory to construct the trie, so that doesn’t work.

So basically in the extreme case where we can’t possibly store 100TB on the computer or even on a few external hard drives, we need the “cloud” to solve this. That is, we need to somehow use the cloud to check if the IDs have already been used, and then to generate one if it’s not been created or used.

The question is how to optimally do this so it takes the least amount of time to generate all the IDs.

Not looking for answers like “don’t worry about duplicates”, or “that’s too many IDs to generate”. I would like to know how to specifically solve this problem.

What I have resorted to is basically:

1. Check in “cloud” database if record exists.
2. If so, try a different value and repeat.
3. If not, then save it to the database.

But this is extremely slow and would take weeks of computer time to run and generate the IDs. It seems like there could be a data structure and algorithm to make this faster.

## Public Availability of a good Dataset in PCAP (TCPDUMP) format for IDS/IPS testing

I am trying to pass good reputable malicious traffic from an IPS. There are several sources on internet to explore datasets like the oldest I think DARPA set (not available in pcap format and not that efficient for modern day use ) or NSL-KDD dataset etc. Here is a good link I found about options that I can look into. However none of them has dataset available in pcap format. Is there any reputable dataset available in PCAP or TCPDUMP or convertable to PCAP?

Thank you.

## Convert a huge txt-file into a dataset

My friend has this huge txt-log of sea levels. He wants to organize it into a dataset.

After importing it this file a used StringSplit to separate it into rows, then to singular elements

``rawData = Import["rawData.txt"]; splitRawData = StringSplit[rawData, "%%"]; dataIwant = splitRawData[[19]]; FullForm[dataIwant]; splitDataIntoRows = StringSplit[dataIwant, "\n"]; splitData1 = StringSplit[splitDataIntoRows, " "]; ``

I want to use this function to split the data into 6 columns.

``convertListToAssociation =   list \[Function]    AssociationThread[{"Time (kyr BP)", "Sea level (m)", "T_NH(deg C)", "T_dw (deg C)", "delta_w", "delta_T"}, list] ``

What are further steps to be taken?

## Manipulate using data from Dataset

``data = Dataset[{<|"userId" -> 5311,  "Rec" ->   "You need to improve in Free Response Question type"|>,<|"userId" -> 5312,  "Rec" ->   "You need to improve in Write Code Question type"|>, <|"userId"-> 5313,  "Rec" ->   "You have a great performance in all question types"|>,<|"userId" -> 5314,  "Rec" ->   "You need to improve in Multiple Choice Question type"|>}] ``

Is there a way to create a manipulate such that the control shows the userId and as I move the control, I see the text corresponding to each userId. Something similar to this:

``Manipulate[Grid[{{userId}, {"text"}}, BaseStyle -> {FontFamily -> "Roboto"}], {{userId, 1}, 1, 5}] ``

I’m using the below code to create a dataset in powerbi. When I’m trying to add a row to the dataset, I’m getting 404 not found error \$ col1 = New-PowerBIColumn -Name UID -DataType String \$ col2 = New-PowerBIColumn -Name Name -DataType String

\$ tables = New-PowerBITable -Name SampleTables -Columns \$ col1,\$ col2

\$ dataset= New-PowerBIDataSet -Name SampleReports -Tables \$ tables

Add-PowerBIDataSet -DataSet \$ dataset -WorkspaceId <>

//The dataset has been reflected in the respesctive Workspace.

Add-PowerBIRow -DatasetId <> -TableName tables -Rows \$ Info-WorkspaceId <>

The last command returns: Add-PowerBIRow : Operation returned an invalid status code ‘NotFound’ At line:1 char:1 + Add-PowerBIRow -Dataset \$ dataset-TableName SampleTables -Rows \$ … + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + CategoryInfo : WriteError: (Microsoft.Power…a.AddPowerBIRow:AddPowerBIRow) [Add-PowerBIRow], HttpOperationException + FullyQualifiedErrorId : Operation returned an invalid status code ‘NotFound’,Microsoft.PowerBI.Commands.Data.AddPowerBIRow

## Assumption of a generation of the dataset by a probability distribution

Consider the following paragraph from the deeplearningbook

The training and test data are generated by a probability distribution over datasets called the data-generating process. We typically make a set of assumptions known collectively as the i.i.d. assumptions. These assumptions are that the examples in each dataset are independent from each other, and that the training set and test set are identically distributed, drawn from the same probability distribution as each other. This assumption enables us to describe the data-generating process with a probability distribution over a single example. The same distribution is then used to generate every train example and every test example. We call that shared underlying distribution the data-generating distribution, denoted $$p_{data}$$. This probabilistic framework and the i.i.d. assumptions enables us to mathematically study the relationship between training error and test error.

Bolded area is difficult for me to comprehend. Here I have the following issues in interpreting.

1) How probability distribution is generating a dataset?

2) Are the generation process and probability distribution the same?

3) What is the sample space and random experiment for the underlying probability distribution?

## Input an research the name on my dataset like repertory of phone [on hold]

Hi I’m the problem in my code in python

My code is

name=df[“Nom latin”]

X=input(list(filter(lambda x: ‘ ‘ in x,name)))

print(X)

I need to filtrate the ‘nom latin’ from my datset that I do actually, I need to input the alphabet like afg , and filtrate any name contains afg when I do in input afg To output any names contains afg