Datasets and Time values: Filtering and Plotting

After struggling for a few weeks, I have accumulated a few examples of challenging date-related operatations with datasets. Some I have partial solutions; others I cannot yet do. The resources on Datetime objects in datasets is thin on here (both are relatively new to Mathematica, so no surprise), so maybe this will fill in that gap.

For all of my data, we have events that occur at a datetime. We want to understand how the progress on given days compare; for example, when did task A occur on April 4, 2017 vs October 12, 2018? What is the distribution for Task A occurring throughout a day over 100 days?

  1. How can I plot a DateHistogram with a bin of, say, 20 minutes? I can do it by hour: DateHistogram[dataset, "Hour", DateReduction->"Day"]

  2. How can I utilize the DateHistogram plot style using the operator form? A relatively new functionality allows the syntax dataset[plotstyle,"Key"] (see this nice StackExchange thread). For example, I can implement:

    dataset[Groupby[Key["State"]] /* (PieChart[#, ChartLabels->Keys[#]]&), Length]

    I want to use this syntax with DateHistogram; I can implement the most basic version:

    dataset[DateHistogram,"Sent"]

    I want to add the DateReduction option, along the lines of:

    dataset[DateHistogram["Hour",DateReduction->"Day"],"Sent"]

    Unfortunately, the above example doesn’t work and I can’t find more documentation.

  3. How can I select objects in a certain time window? I figured out a method, but perhaps there is a more elegant solution. In the below example, I can plot the events that occur before 2016:

    eventsBefore2016 = dataset[Select[#Sent<DateObject[{2016,1,1}]&,"Sent"]; DateHistogram[eventsBefore2016,"Hour",DateReduction->"Day"]

    Can I make these two lines into one line of code, for example? Could I do it using the operate form (question 2)?

  4. How can I adapt a DistributionChart for a dataset of Datetime values? I don’t example starter code for this one.

(Apologies for the generic code examples below; I can’t share the actual data and my workplace blocks the tutorial datasets such as titanic. If this is a real sticking issue I can adapt this question with a tutorial dataset later at home.)

Post filtering is returning blank page

I embedded some code to generate post filters referencing an article at https://premium.wpmudev.org/blog/add-post-filters/ as displayed below:

<form class='post-filters'> <select name="orderby"> <?php   $  orderby_options = array(     'post_date' => 'Order By Date',     'post_title' => 'Order By Title',     'rand' => 'Random Order',   );   foreach( $  orderby_options as $  value => $  label ) {     echo "<option ".selected( $  _GET['orderby'], $  value )."      value='$  value'>$  label</option>";   } ?> 

‘Descending’, ‘ASC’ => ‘Ascending’, ); foreach( $ order_options as $ value => $ label ) { echo “$ label”; } ?>

Consequently, the page the same code is embedded only returns a blank page showing no code in the source view. I also used Search & Filter plugin only to get the same blank page.

I cannot resolve this issue alone. Please someone help me correct the code.

React – Bottleneck Textinput / Filtering

I’ve got a big react app (with Redux) here that has a huge bottleneck.

We have implemented a product search by using product number or product name and this search is extremely laggy.

Problem: If a user types in some characters, those characters are shown in the InputField really retarded. The UI is frozen for a couple of seconds. In Internet Explorer 11, the search is almost unusable.

It’s a Material UI TextField that filters products.

What I already did for optimization:

  1. Replaced things like style={{ maxHeight: 230, overflowY: ‘scroll’, }} with const cssStyle={..}
  2. Changed some critical components from React.Component to React.PureComponent
  3. Added shouldComponentUpdate for our SearchComponent
  4. Removed some unnecessary closure bindings
  5. Removed some unnecessary objects
  6. Removed all console.log()
  7. Added debouncing for the input field (that makes it even worse)

That’s how our SearchComponent looks like at the moment:

import React, { Component } from 'react'; import PropTypes from 'prop-types'; import Downshift from 'downshift';  import TextField from '@material-ui/core/TextField'; import MenuItem from '@material-ui/core/MenuItem'; import Paper from '@material-ui/core/Paper'; import IconTooltip from '../helper/icon-tooltip';  import { translate } from '../../utils/translations';  const propTypes = {   values: PropTypes.arrayOf(PropTypes.shape({})).isRequired,   legend: PropTypes.string,   helpText: PropTypes.string,   onFilter: PropTypes.func.isRequired,   selected: PropTypes.oneOfType([PropTypes.string, PropTypes.number]),   isItemAvailable: PropTypes.func, };  const defaultProps = {   legend: '',   helpText: '',   selected: '',   isItemAvailable: () => true, };  const mapNullToDefault = selected =>   (selected === null || selected === undefined ? '' : selected);  const mapDefaultToNull = selected => (!selected.length ? null : selected);  class AutoSuggestField extends Component {   shouldComponentUpdate(nextProps) {     return this.props.selected !== nextProps.selected;   }    getLegendNode() {     const { legend, helpText } = this.props;     return (       <legend>         {legend}{' '}         {helpText && helpText.length > 0 ? (           <IconTooltip helpText={helpText} />         ) : (           ''         )}       </legend>     );   }    handleEvent(event) {     const { onFilter } = this.props;      const value = mapDefaultToNull(event.target.value);      onFilter(value);   }    handleOnSelect(itemId, item) {     const { onFilter } = this.props;     if (item) {       onFilter(item.label);     }   }    render() {     const { values, selected, isItemAvailable } = this.props;      const inputValue = mapNullToDefault(selected);     const paperCSSStyle = {       maxHeight: 230,       overflowY: 'scroll',     };     return (       <div>         <div>{this.getLegendNode()}</div>         <Downshift           inputValue={inputValue}           onSelect={(itemId) => {             const item = values.find(i => i.id === itemId);             this.handleOnSelect(itemId, item);           }}         >           {/* See children-function on https://github.com/downshift-js/downshift#children-function */}           {({             isOpen,             openMenu,             highlightedIndex,             getInputProps,             getMenuProps,             getItemProps,             ref,           }) => (             <div>               <TextField                 className="searchFormInputField"                 InputProps={{                   inputRef: ref,                   ...getInputProps({                     onFocus: () => openMenu(),                     onChange: (event) => {                       this.handleEvent(event);                     },                   }),                 }}                 fullWidth                 value={inputValue}                 placeholder={translate('filter.autosuggest.default')}               />               <div {...getMenuProps()}>                 {isOpen && values && values.length ? (                   <React.Fragment>                     <Paper style={paperCSSStyle}>                       {values.map((suggestion, index) => {                         const isHighlighted = highlightedIndex === index;                         const isSelected = false;                         return (                           <MenuItem                             {...getItemProps({ item: suggestion.id })}                             key={suggestion.id}                             selected={isSelected}                             title={suggestion.label}                             component="div"                             disabled={!isItemAvailable(suggestion)}                             style={{                               fontWeight: isHighlighted ? 800 : 400,                             }}                           >                             {suggestion.label}                           </MenuItem>                         );                       })}                     </Paper>                   </React.Fragment>                 ) : (                   ''                 )}               </div>             </div>           )}         </Downshift>       </div>     );   } }  AutoSuggestField.propTypes = propTypes; AutoSuggestField.defaultProps = defaultProps;  export default AutoSuggestField;
<script src="https://cdnjs.cloudflare.com/ajax/libs/react/16.5.0/umd/react.production.min.js"></script> <script src="https://cdnjs.cloudflare.com/ajax/libs/react-dom/16.5.0/umd/react-dom.production.min.js"></script>

It seems, that I did not find the performance problem as it still exists. Can someone help here?

Filtering and creating new columns by condensing the lists for each item information

I am trying to improve my programming skills at work (analyst) and one of the engineering projects I worked is around ETL. Essentially, we roll up all individuals account information to a single row for each user and their purchases are converted to a list. There are two different logics which uses different columns for rollups with a slightly different way for performing the conversion. This is done by splitting the data initially and then combined. I wrote PySpark code (first time) and was hoping that the experts here can give me some feedback before I share it with my manager. It’s a new job, so I really hope that I can give my best.

Here are the example of what I was trying to accomplish with my code alongside it. My manager is hoping for ‘modular’ and ‘refactored’ code. I have tried my best to follow pep-8 style, but apart from that, I am unsure as to how to improve this code.

from pyspark.sql import functions as F   # Load the trade files from the filestore adfe_df = df1   # Drop the account column since each account has a different token every time we get new data (monthly data) adfe_df = adfe_df.drop('Col1')  # Drop duplicate Rows - Rows are duplicated every time we get new data adfe_df = adfe_df.dropDuplicates()  adfe_df = adfe_df.orderBy('Col2') adfe_df2 = adfe_df.filter(F.col("Col3").isNotNull()) adfe_df3 = adfe_df.filter(F.col("Col3").isNull() & F.col('Col4').isNotNull())  # Grouping by account information and creating lists for all the variables we care about adfe_df2 = adfe_df2\                     .groupBy('Col5')\                     .agg(                         F.collect_list('Col6').alias('Col6'),                         F.collect_list('Col7').alias('Col7'),                         F.collect_list('Col8').alias('Col8'),                         F.collect_list('Col9').alias('1'),                         )  # Creating a list of colummns we want to split --> Payment_tracking_cycle_* columns = adfe_df2.select('1',).columns  # Collects the sizes of each colum in consideration   adfe_df2_sizes = adfe_df2.select(*[F.size(col).alias(col) for col in columns])  # Get the max number of columns from each column adfe_df2_max = adfe_df2_sizes.agg(*[F.max(col).alias(col) for col in columns])  # Pick the first element in each of the max array size length max_dict = adfe_df2_max.collect()[0].asDict()  # Splitting the columns into multiple columns and renaming the columns to "AB" + x + y format adfe_df2_result = adfe_df2.select('Col6',                                   'Col7',                                   'Col8',                                   '1', *[adfe_df2[col][i].alias("AB" + str(i + 1) + col) for col in columns for i in range(max_dict[col])])  adfe_df2_result.cache()   # Converting the column values to bool (columns store actual values currently) adfe_df3 = adfe_df3\                     .withColumn('Col10', F.when(adfe_df3['Col10'] > 0, 1)                         .otherwise(adfe_df3['Col10']))\  # Grouping by account information and creating lists for all the variables we care about adfe_df3 = adfe_df3.groupBy('Col5')\                     .agg(                         F.collect_list('Col6').alias('Col6'),                         F.collect_list('Col7').alias('Col7'),                         F.collect_list('Col8').alias('Col8'),                         F.collect_list('Col10').alias('2'),                         )    # Creating a list of colummns we want to split --> Payment_tracking_cycle_* columns = adfe_df3.select('2',).columns  # Collects the sizes of each colum in consideration   adfe_df3_sizes = adfe_df3.select(*[F.size(col).alias(col) for col in columns])  # Get the max number of columns from each column adfe_df3_max = adfe_df3_sizes.agg(*[F.max(col).alias(col) for col in columns])  # Pick the first element in each of the max array size length max_dict = adfe_df3_max.collect()[0].asDict()  # Splitting the columns into multiple columns and renaming the columns to "AB" + x + y format adfe_df3_max_result = adfe_df3.select('Col6',                                   'Col7',                                   'Col8',                                   'Col9', *[adfe_df3[col][i].alias("AB" + str(i + 1) + col) for col in columns for i in range(max_dict[col])])  adfe_df3_result.cache()   # Create a list of all the unique columns in the two tables adfe_df2_result_cols = adfe_df2_result.columns adfe_df3_result_cols = adfe_df3_result.columns  all_abxy_list = list( set(adfe_df2_result_cols) | set(adfe_df3_result_cols))   # Verify if the column is present in the dataframe, if it is missing, create the null column with that name and cast with type 'int' for consistency for column in all_abxy_list:   if column not in adfe_df2_result_cols.columns:      = adfe_df2_result_cols.withColumn(column, F.lit(None).cast('int'))   if column not in adfe_df3_result_cols.columns:     adfe_df3_result_cols = adfe_df3_result_cols.withColumn(column, F.lit(None).cast('int'))  # Join the two tables for the final ABxy table for the accounts that have either pieces of information  adfe_abxy= (adfe_df2_result_cols.union(adfe_df3_result_cols))  

tshark filtering out same vlan traffic from command line

I am looking for help on how I can filter out same VLAN traffic(i.e. src IPs 10.1.1.0/24 and dst IPs 10.1.1.0/24) out of a capture file and output all other traffic to a txt file.

I am using a batch file to process all the captures using tshark in a directory and want to use a display filter to remove this “same VLAN” traffic to reduce the size of the output.

Here is the batch file I am using:

@echo off  set cDate=captureday set cap_files="*.pcapng" set cap_folder="D:\caps\%cDate%" set outfile-udp=D:\conversations\cap-UDP-%cDate%.txt set outfile-tcp=D:\conversations\cap-TCP-%cDate%.txt set tshark_cmd="c:\Program Files\Wireshark\tshark" set tshark_udp=-Y "!(ipv6)" -q -z conv,udp set tshark_tcp=-Y "!(ip.src in {10.1.1.0/24} && ip.dst in {10.1.1.0/24})" -q -z conv,tcp  echo. > %outfile-udp% echo. > %outfile-tcp%  for /r %cap_folder% %%f in (%cap_files%) do (     echo Processing File: %%f      REM echo == File:  %%f >> %outfile%     %tshark_cmd%  -r %%f %tshark_udp% >> %outfile-udp%     %tshark_cmd%  -r %%f %tshark_tcp% >> %outfile-tcp% )  echo. echo Results in: %outfile-udp%  echo Results in: %outfile-tcp% 

This doesn’t seem to work however so I was wondering if someone out there has done this before and can tell me what I am doing incorrectly here. Basically I am still getting this 10.1.1.0/24 <-> 10.1.1.0/24 traffic.

This is for already existing captures. I have already got a capture filter running this ignores this traffic.

This is an example of the output of the text file:


================================================================================ TCP Conversations Filter:<No Filter>                                                            |       <-      | |       ->      | |     Total     |    Relative    |   Duration   |                                                            | Frames  Bytes | | Frames  Bytes | | Frames  Bytes |      Start     |              | 10.1.1.154:51087        <-> 10.1.1.202:5482           20016   1417372   13477  20140238   33493  21557610     4.393919000       124.2292 10.1.1.154:51088        <-> 10.1.1.201:5482           17479   1170181   11871  17779866   29350  18950047    10.602867000       118.3747 10.1.1.154:52765        <-> 10.1.1.194:2021            7346    584098    4201    338242   11547    922340     0.187889000       128.7159 10.1.1.154:57613        <-> 10.1.1.193:2021            7197    572273    4111    331079   11308    903352     0.182195000       128.7268 10.1.1.154:52120        <-> 10.1.1.192:2021            7180    570921    4104    330495   11284    901416     0.136573000       128.7624 10.1.1.154:61285        <-> 10.1.1.190:2021            7126    566585    4066    327231   11192    893816     0.182070000       128.7277 10.1.1.154:52738        <-> 10.1.1.191:2021            6301    500987    3602    289742    9903    790729    18.821383000       110.0925 

Alternatively if there is a way just to delete out the lines the the source and destination IP’s using SED or findstr I would be happy to do this as well 😉

Many thanks in advance

Cheers

spatial element count data structure eg: R-Tree, with filtering

Background: The application is a web map application with a LeafletJS front-end and NodeJS back-end with Postgres database. The application is to be used on both desktops and smartphones/ tablets in the field. The map should visualise ~30K polygons whose number grows by ~3k per year. The data set has insertions and deletions less than once every 12 hours but has queries constantly. At all zooms, a representation of the entire spatial data set should be available. At lower zooms a cluster or heatmap representation is ideal. At high zooms the individual polygons should be visible. The main issue is that the data set count represented by clustering must also be filterable by a finite number of sets each with finite number of options membership (year, type of survey etc.)

A naive implementation would be to calculate the centre/ centroid/ pole-of-inaccessibility of each polygon along with its set options and send these to a standard Leaflet clustering library on the client side to visualise when below a threshold zoom level, and to send the polygons themselves when above a zoom level. The client-controlled filter would iterate through each layer in the cluster or polygon set.

It seems to me that a better cluster would be to build a R-Tree server-side and at each node level include the total child count, then on the client-side each cluster is represented as this child count at the centre of its node’s bounding box. Above the threshold zoom, polygons for that area are also stored in a client-side R-Tree to avoid querying the database for areas that have been traversed more than once.

(A) Is this a sensible data structure and method of representing clusters?

(B) How can it be extended to compute child count of subsets at different levels of zoom, such as calculating the count of every exclusive set at each level? (eg: count of elements in years x1 to x2 and survey type a,b,c, not d)

Server side lookup field filtering

I am developing my own lookup field.

I want it to have dynamically filtered values collection depends on user which makes new item.

What method of lookupfield class I should override to make my own collection of values to send on client to render there?

I need send from server to client side additional info about items in related lookup list as well.

And all this I really need to do on server side.

Multiple solutions all about client side… REST CSOM JSLINK CSR..

Any advice please about server.

imnplimenting object and retreiving list, filtering, invoking methods. low level system code, [on hold]

I’m thinking about system level code, and am wondering about how to implement objects whose fields can be accessed.

So if we have an object named task, with a members such as tid and kill. we could get at the tid by calling task->tid. No what if we had to deal with many task objects, such as we’d want to kill all tasks with a tid > 100.

How could this be implemented properly? A naive implementation might loop through each object and check the tid, but this already isn’t enough. The object can be changed or modified in any way, to make this robust and performant. As long as we identify where any new methods are, how the result is accessed, it would help me understand how this might be approached.