Fast reading large table

I have csv file structured as below:

1,0,2.2,0,0,0,0,1.2,0 0,1,2,4,0,1,0.2,0.1,0 0,0,2,3,0,0,0,1.2,2.1 0,0,0,1,2,1,0,0.2,0.1 0,0,1,0,2.1,0.1,0,1.2 0,0,2,3,0,1.1,0.1,1.2 0,0.2,0,1.2,2,0,3.2,0 0,0,1.2,0,2.2,0,0,1.1 

but with 10k columns and 10k rows. I want to read it in such a way that in the result i get a dictionary with Key as a index of the row and Value as float array filed with every value in this row. For now my code look like this:

 var lines = File.ReadAllLines(filePath).ToList();  var result = lines.AsParallel().AsOrdered().Select((line, index) =>  {     var values = line?.Split(',').Where(v =>!string.IsNullOrEmpty(v))          .Select(f => f.Replace('.', ','))                .Select(float.Parse).ToArray();     return (index, values);        }).ToDictionary(d => d.Item1, d => d.Item2); 

but it takes up to 30 seconds to finish, so it’s quite slow and i want to optimize it to be a bit faster.

What is the most semantic way of sending a large array of strings to an API

I’m building a simple app that is a client web application and an API behind it. I’m combining two third-party APIs, one needs to be called from my client application (for ease of authentication) and one needs to be called by the server.

The data I receive from my first API is a list of strings – it will be between 0 and 150 items long, each string is an unknown length. Each one of these items needs to be passed to my server – either as individual elements (preferred) or as a serialised string. My endpoint will always return a single object, regardless of how many items are passed to it.

What is the most semantic way of passing this data to my API?

It’s not really a GET request as the returned object is dynamic, and I’m concerned about the URL length limits (discussed here) given that I don’t know what these strings will look like.

POST also feels incorrect as I’m not going to storing what is sent to me, though this would allow me to send the data in the body of the request and not worry about the size.

Perhaps I shouldn’t worry so much about semantics and just do what works, but I’m interested in opinions on the best, or ‘proper’, way to design my API here.

MobX vs Redux? Which state management solution can I adopt for large scale application development?

I was surfing google for the hunt of best state management solution for developing a large scale React application, which might include some heavy presentational components like Graphs and Grids. I came across many comparisons between Redux vs MobX. But one thing I cannot understand is regarding the Scalabilility, It says MobX is comparatively less scalable than redux. I am not sure in which point of view the scalability is measured between these two. But it also says MobX is more efficient and performant than Redux.

Please suggest which one to use for my React application?

Finder-like File manager that displays thumbnails arbitrarily large and allows free movement

I’m looking for a file manager that will make it easier to match up photos. I have two sets of photos with different names, and I need to rename like photos with similar names. I have been using PCManFM, but it has two problems:

  1. I can’t make the thumbnails any large. I need something like OS X where I can set the dimensions of the thumbnails, for instance to 128×128 or 256×256 as my needs change.

  2. I need to be able to drag files around and sit them arbitrarily on the desktop. To do this I need to disable any automatic sorting or positioning of icons (also similar to what OS X provides). In this way I can drag similar files next to each other before they get renamed.

PCManFM does not provide either functionality. I have tried Thunar but while it lets me “zoom in” increasing the thumbnail size it isn’t always enough. In addition, it doesn’t let me disable the automatic sorting and resizing.

I have seen Pantheon but I’m running LUbuntu and don’t want to mess with trying to install it unless I know it will work.

Are there any other file managers that will do this which will easily run on LUbuntu 18.04?

Variant of the Strong Law of Large Numbers

Let $ X_1,X_2,\ldots$ be a i.i.d. sequence of random variables with uniform distribution on $ [0,1]$ , with $ X_n: \Omega \to \mathbf{R}$ for each $ n$ .

Question. Is it true that $ $ \mathrm{Pr}\left(\left\{\omega \in \Omega: \lim_{n\to \infty}\frac{\sum_{1\le i\neq j \le n}{\bf{1}}_{(-1/n,1/n)}{(X_i(\omega)-X_j(\omega))}}{n}=2\right\}\right)=1\,\,\,? $ $

Here $ {\bf{1}}_A(z)$ is the characteristic function of $ A$ , that is, it is $ 1$ if $ z \in A$ and $ 0$ otherwise.

Large time behavior of Girsanov type Geometric Brownian Motion with time-dependent drift and diffusion

Recall the Geometric Brownian Motion $ X={\rm e}^{\mu W+\left(\sigma-\frac{\mu^2}{2}\right)t’}$ . If $ \sigma<\frac{\mu^2}{2}$ , then $ X$ tends to 0 almost surely. But if we consider the following case, $ $ X=\exp\left\{\int_0^t\mu(t’){\rm d} W+\int_0^t\sigma(t’)-\frac{\mu^2(t’)}{2}{\rm d}t’\right\},$ $ and also assume that $ \sigma(t)<\frac{\mu^2(t)}{2}$ for all the $ t>0$ ($ \mu$ and $ \sigma$ are assumed to be good enough), do we also have the almost decay property? I mean, $ X$ tends to $ 0$ , almost surely? It looks like right, but how would the proof look like? I’m not really sure how to approach it at the moment. Any help is appreciated. Many thanks!

Proof of Weak Law of Large Numbers


For a sequence of i.i.d. random variable $ (X_n)$ with $ \Bbb{E}X=m$ , $ \overline{X}_n=(X_1+…+X_n)/n$ , then $ \overline{X}_n\stackrel{P}{\rightarrow}m$ .

Since $ (X_n)$ are i.i.d., $ (X_n)$ induce the same probability measure $ \mu$ on $ \Bbb{R}$ . So $ $ \Bbb{P}\{|\overline{X}_n-m|>\epsilon\}=\int_{\{|\frac{x_1+…+x_n}{n}-m|>\epsilon\}}\Bbb{1}(\mathrm{d}\mu)^n$ $ where $ (\mathrm{d}\mu)^n$ is the product measure of $ \mu$ . I want to know if we can prove weak law of large numbers by this representation. Any help will be appreciated.

Efficient way to optimise a Scala code to read large file that doesn’t fit in memory


Problem Statement Below,

We have a large log file which stores user interactions with an application. The entries in the log file follow the following schema: {userId, timestamp, actionType} where actionType is one of two possible values: [open, close]

Constraints:

  1. The log file is too big to fit in memory on one machine. Also assume that the aggregated data doesn’t fit into memory.
  2. Code has to be able to run on a single machine.
  3. Should not use an out-of-the box implementation of mapreduce or 3rd party database; don’t assume we have a Hadoop or Spark or other distributed computing framework.
  4. There can be multiple entries of each actionType for each user, and there might be missing entries in the log file. So a user might be missing a close record between two open records or vice versa.
  5. Timestamps will come in strictly ascending order.

For this problem, we need to implement a class/classes that computes the average time spent by each user between open and close. Keep in mind that there are missing entries for some users, so we will have to make a choice about how to handle these entries when making our calculations. Code should follow a consistent policy with regards to how we make that choice.

The desired output for the solution should be [{userId, timeSpent},….] for all the users in the log file.

Sample log file (comma-separated, text file)

1,1435456566,open  2,1435457643,open  3,1435458912,open  1,1435459567,close  4,1435460345,open  1,1435461234,open  2,1435462567,close  1,1435463456,open  3,1435464398,close  4,1435465122,close  1,1435466775,close 

Approach

Below is the code I’ve written in Python & Scala, which seems to be not efficient and upto the expectations of the scenario given, I’d like to feedback from community of developers in this forum how better we could optimise this code as per given scenario.

Scala implementation

import java.io.FileInputStream import java.util.{Scanner, Map, LinkedList} import java.lang.Long import scala.collection.mutable  object UserMetrics extends App {   if (args.length == 0) {     println("Please provide input data file name for processing")   }    val userMetrics = new UserMetrics()   userMetrics.readInputFile(args(0),if (args.length == 1) 600000 else args(1).toInt) }  case class UserInfo(userId: Integer, prevTimeStamp: Long, prevStatus: String, timeSpent: Long, occurence: Integer)  class UserMetrics {    val usermap = mutable.Map[Integer, LinkedList[UserInfo]]()    def readInputFile(stArr:String, timeOut: Int) {     var inputStream: FileInputStream = null     var sc: Scanner = null     try {       inputStream = new FileInputStream(stArr);       sc = new Scanner(inputStream, "UTF-8");       while (sc.hasNextLine()) {         val line: String = sc.nextLine();         processInput(line, timeOut)       }        for ((key: Integer, userLs: LinkedList[UserInfo]) <- usermap) {         val userInfo:UserInfo = userLs.get(0)         val timespent = if (userInfo.occurence>0) userInfo.timeSpent/userInfo.occurence else 0         println("{" + key +","+timespent + "}")       }        if (sc.ioException() != null) {         throw sc.ioException();       }     } finally {       if (inputStream != null) {         inputStream.close();       }       if (sc != null) {         sc.close();       }     }   }    def processInput(line: String, timeOut: Int) {     val strSp = line.split(",")      val userId: Integer = Integer.parseInt(strSp(0))     val curTimeStamp = Long.parseLong(strSp(1))     val status = strSp(2)     val uInfo: UserInfo = UserInfo(userId, curTimeStamp, status, 0, 0)     val emptyUserInfo: LinkedList[UserInfo] = new LinkedList[UserInfo]()      val lsUserInfo: LinkedList[UserInfo] = usermap.getOrElse(userId, emptyUserInfo)      if (lsUserInfo != null && lsUserInfo.size() > 0) {       val lastUserInfo: UserInfo = lsUserInfo.get(lsUserInfo.size() - 1)       val prevTimeStamp: Long = lastUserInfo.prevTimeStamp       val prevStatus: String = lastUserInfo.prevStatus        if (prevStatus.equals("open")) {         if (status.equals(lastUserInfo.prevStatus)) {            val timeSelector = if ((curTimeStamp - prevTimeStamp) > timeOut) timeOut else curTimeStamp - prevTimeStamp            val timeDiff = lastUserInfo.timeSpent + timeSelector           lsUserInfo.remove()           lsUserInfo.add(UserInfo(userId, curTimeStamp, status, timeDiff, lastUserInfo.occurence + 1))         } else if(!status.equals(lastUserInfo.prevStatus)){           val timeDiff = lastUserInfo.timeSpent + curTimeStamp - prevTimeStamp           lsUserInfo.remove()           lsUserInfo.add(UserInfo(userId, curTimeStamp, status, timeDiff, lastUserInfo.occurence + 1))         }       } else if(prevStatus.equals("close")) {         if (status.equals(lastUserInfo.prevStatus)) {           lsUserInfo.remove()           val timeSelector = if ((curTimeStamp - prevTimeStamp) > timeOut) timeOut else curTimeStamp - prevTimeStamp           lsUserInfo.add(UserInfo(userId, curTimeStamp, status, lastUserInfo.timeSpent + timeSelector, lastUserInfo.occurence+1))         }else if(!status.equals(lastUserInfo.prevStatus))           {                lsUserInfo.remove()           lsUserInfo.add(UserInfo(userId, curTimeStamp, status, lastUserInfo.timeSpent, lastUserInfo.occurence))         }       }     }else if(lsUserInfo.size()==0){       lsUserInfo.add(uInfo)     }     usermap.put(userId, lsUserInfo)   }  } 

Python Implementation

import sys  def fileBlockStream(fp, number_of_blocks, block):     #A generator that splits a file into blocks and iterates over the lines of one of the blocks.      assert 0 <= block and block < number_of_blocks #Assertions to validate number of blocks given     assert 0 < number_of_blocks      fp.seek(0,2) #seek to end of file to compute block size     file_size = fp.tell()       ini = file_size * block / number_of_blocks #compute start & end point of file block     end = file_size * (1 + block) / number_of_blocks      if ini <= 0:         fp.seek(0)     else:         fp.seek(ini-1)         fp.readline()      while fp.tell() < end:         yield fp.readline() #iterate over lines of the particular chunk or block  def computeResultDS(chunk,avgTimeSpentDict,defaultTimeOut):     countPos,totTmPos,openTmPos,closeTmPos,nextEventPos = 0,1,2,3,4     for rows in chunk.splitlines():         if len(rows.split(",")) != 3:             continue         userKeyID = rows.split(",")[0]         try:             curTimeStamp = int(rows.split(",")[1])         except ValueError:             print("Invalid Timestamp for ID:" + str(userKeyID))             continue         curEvent = rows.split(",")[2]         if userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==1 and curEvent == "close":          #Check if already existing userID with expected Close event 0 - Open; 1 - Close         #Array value within dictionary stores [No. of pair events, total time spent (Close tm-Open tm), Last Open Tm, Last Close Tm, Next expected Event]             curTotalTime = curTimeStamp - avgTimeSpentDict[userKeyID][openTmPos]             totalTime = curTotalTime + avgTimeSpentDict[userKeyID][totTmPos]             eventCount = avgTimeSpentDict[userKeyID][countPos] + 1             avgTimeSpentDict[userKeyID][countPos] = eventCount             avgTimeSpentDict[userKeyID][totTmPos] = totalTime             avgTimeSpentDict[userKeyID][closeTmPos] = curTimeStamp             avgTimeSpentDict[userKeyID][nextEventPos] = 0 #Change next expected event to Open          elif userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==0 and curEvent == "open":             avgTimeSpentDict[userKeyID][openTmPos] = curTimeStamp             avgTimeSpentDict[userKeyID][nextEventPos] = 1 #Change next expected event to Close          elif userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==1 and curEvent == "open":             curTotalTime,closeTime = missingHandler(defaultTimeOut,avgTimeSpentDict[userKeyID][openTmPos],curTimeStamp)             totalTime = curTotalTime + avgTimeSpentDict[userKeyID][totTmPos]             avgTimeSpentDict[userKeyID][totTmPos]=totalTime             avgTimeSpentDict[userKeyID][closeTmPos]=closeTime             avgTimeSpentDict[userKeyID][openTmPos]=curTimeStamp             eventCount = avgTimeSpentDict[userKeyID][countPos] + 1             avgTimeSpentDict[userKeyID][countPos] = eventCount                    elif userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==0 and curEvent == "close":              curTotalTime,openTime = missingHandler(defaultTimeOut,avgTimeSpentDict[userKeyID][closeTmPos],curTimeStamp)             totalTime = curTotalTime + avgTimeSpentDict[userKeyID][totTmPos]             avgTimeSpentDict[userKeyID][totTmPos]=totalTime             avgTimeSpentDict[userKeyID][openTmPos]=openTime             eventCount = avgTimeSpentDict[userKeyID][countPos] + 1             avgTimeSpentDict[userKeyID][countPos] = eventCount          elif curEvent == "open":             #Initialize userid with Open event             avgTimeSpentDict[userKeyID] = [0,0,curTimeStamp,0,1]          elif curEvent == "close":             #Initialize userid with missing handler function since there is no Open event for this User             totaltime,OpenTime = missingHandler(defaultTimeOut,0,curTimeStamp)             avgTimeSpentDict[userKeyID] = [1,totaltime,OpenTime,curTimeStamp,0]  def missingHandler(defaultTimeOut,curTimeVal,lastTimeVal):     if lastTimeVal - curTimeVal > defaultTimeOut:         return defaultTimeOut,curTimeVal     else:         return lastTimeVal - curTimeVal,curTimeVal  def computeAvg(avgTimeSpentDict,defaultTimeOut):     resDict = {}     for k,v in avgTimeSpentDict.iteritems():         if v[0] == 0:             resDict[k] = 0         else:             resDict[k] = v[1]/v[0]     return resDict  if __name__ == "__main__":     avgTimeSpentDict = {}     if len(sys.argv) < 2:         print("Please provide input data file name for processing")         sys.exit(1)      fileObj = open(sys.argv[1])     number_of_chunks = 4 if len(sys.argv) < 3 else int(sys.argv[2])     defaultTimeOut = 60000 if len(sys.argv) < 4 else int(sys.argv[3])     for chunk_number in range(number_of_chunks):         for chunk in fileBlockStream(fileObj, number_of_chunks, chunk_number):             computeResultDS(chunk, avgTimeSpentDict, defaultTimeOut)     print (computeAvg(avgTimeSpentDict,defaultTimeOut))     avgTimeSpentDict.clear() #Nullify dictionary      fileObj.close #Close the file object  

Both program above gives desired output, but efficiency is what matters for this particular scenario. Let me know if you’ve anything better or any suggestions on existing implementation.

Thanks in Advance!!