caar package

Submodules

caar.cleanthermostat module

caar.cleanthermostat.dict_from_file(raw_file, cycle=None, states=None, sensors_file=None, postal_file=None, auto=None, id_col_heading=None, cycle_col_heading=None, encoding='UTF-8', delimiter=None, quote=None, cols_to_ignore=None, meta=False)[source]

Read delimited text file and create dict of dicts. One dict within the dict has the key ‘cols_meta’ and contains metadata. The other has the key ‘records’. The records keys are named 2-tuples containing numeric IDs and time stamps (and cycle mode if a cycle mode is chosen with the argument ‘cycle=’, for cycling data). The values are either single values (floats, ints or strings) or tuples of these types.

See the example .csv data files at https://github.com/nickpowersys/caar.

Example sensor cycle file column headings: DeviceId, CycleType, StartTime, EndTime.

Example sensor file column headings: SensorId, TimeStamp, Degrees.

Example outside temperature file column headings LocationId, TimeStamp, Degrees.

Common delimited text file formats including commas, tabs, pipes and spaces are detected in that order within the data rows (the header has its own delimiter detection and is handled separately, automatically) and the first delimiter detected is used. In all cases, rows are only used if the number of values match the number of column headings in the first row.

Each input file is expected to have (at least) columns representing ID’s, time stamps (or starting and ending time stamps for cycles), and (if not cycles) corresponding observations.

To use the automatic column detection functionality, use the keyword argument ‘auto’ and assign it one of the values: ‘cycles’, ‘sensors’, or ‘geospatial’.

The ID’s should contain both letters and digits in some combination (leading zeroes are also allowed in place of letters). Having the string ‘id’, ‘Id’ or ‘ID’ will then cause a column to be the ID index within the combined ID-time stamp index for a given input file. If there is no such heading, the leftmost column with alphanumeric strings (for example, ‘T12’ or ‘0123’) will be taken as the ID.

The output can be filtered on records from a state or set of states by specifying a comma-delimited string containing state abbreviations. Otherwise, all available records will be in the output.

If a state or states are specified, a sensors metadata file and postal code file must be specified in the arguments and have the same location ID columns and ZipCode/PostalCode column headings in the same left-to-right order as in the examples. For the other columns, dummy values may be used if there is no actual data.

Parameters:
  • raw_file (str) – The input file.
  • cycle (Optional[str]) – The type of cycling operation that will be included in the output. For example, possible values that may be in the data file are ‘Cool’ or ‘Heat’. If no specific value is specified as an argument, all operating modes will be included.
  • states (Optional[str]) – One or more comma-separated, two-letter state abbreviations.
  • sensors_file (Optional[str]) – Path of metadata file for sensors. Required if there is a states argument.
  • postal_file (Optional[str]) – Metadata file for zip codes, with zip codes, their state, and other geographic information. Required if there is a states argument.
  • auto (Optional[Boolean]) – {‘cycles’, ‘sensors’, ‘geospatial’, None} If one of the data types is specified, the function will detect which columns contain IDs, time stamps and values of interest automatically. If None (default), the order of columns in the delimited file and the config.ini file should match.
  • id_col_heading (Optional[str]) – Indicates the heading in the header for the ID column.
  • cycle_col_heading (Optional[str]) – Indicates the heading in the header for the cycle mode column.
  • cols_to_ignore (Optional[iterable of [str] or [int]]) – Column headings or 0-based column indexes that should be left out of the output.
  • encoding (Optional[str]) – Encoding of the raw data file. Default: ‘UTF-8’.
  • delimiter (Optional[str]) – Character to be used as row delimiter. Default is None, but commas, tabs, pipes and spaces are automatically detected in that priority order) if no delimiter is specified.
  • quote (Optional[str]) – Characters surrounding data fields. Default is none, but double and single quotes surrounding data fields are automatically detected and removed if they are present in the data rows. If any other character is specified in the keyword argument, and it surrounds data in any column, it will be removed instead.
  • meta (Optional[bool]) – An alternative way to return metadata about columns, besides the detect_columns() function. To use it, meta must be True, and a dict of metadata will be returned instead of a dict of records.
Returns:

Dict.

Return type:

clean_dict (dict)

caar.cleanthermostat.detect_columns(raw_file, cycle=None, states=None, sensors_file=None, postal_file=None, auto=None, encoding='UTF-8', delimiter=None, quote=None, id_col_heading=None, cycle_col_heading=None, cols_to_ignore=None)[source]

Returns dict with columns that will be in dict based on dict_from_file() or pickle_from_file() and corresponding keyword arguments (‘auto’ is required, and must be a value other than None).

Parameters:
  • raw_file (str) – The input file.
  • cycle (Optional[str]) – The type of cycle that will be in the output. For example, example values that may be in the data file are ‘Cool’ and/or ‘Heat’. If no specific value is specified as an argument, all modes will be in the output.
  • states (Optional[str]) – One or more comma-separated, two-letter state abbreviations.
  • sensors_file (Optional[str]) – Path of metadata file for sensors. Required if there is a states argument.
  • postal_file (Optional[str]) – Metadata file for postal codes. Required if there is a states argument.
  • auto (Optional[Boolean]) – {‘cycles’, ‘sensors’, ‘geospatial’, None} If one of the data types is specified, the function will detect which columns contain IDs, time stamps and values of interest automatically. If None (default), the order of columns in the delimited file and the config.ini file should match.
  • id_col_heading (Optional[str]) – Indicates the heading in the header for the ID column.
  • cycle_col_heading (Optional[str]) – Indicates the heading in the header for the cycle column.
  • cols_to_ignore (Optional[iterable of [str] or [int]]) – Column headings or 0-based column indexes that should be left out of the output.
  • encoding (Optional[str]) – Encoding of the raw data file. Default: ‘UTF-8’.
  • delimiter (Optional[str]) – Character to be used as row delimiter. Default is None, but commas, tabs, pipes and spaces are automatically detected in that priority order) if no delimiter is specified.
  • quote (Optional[str]) – Characters surrounding data fields. Default is none, but double and single quotes surrounding data fields are automatically detected and removed if they are present in the data rows. If any other character is specified in the keyword argument, and it surrounds data in any column, it will be removed instead.
Returns:

Dict in which keys are one of: ‘id_col’, ‘start_time_col’, ‘end_time_col’, ‘cycle_col’, (the latter three are for cycles data only), ‘time_col’, or the headings of other columns found in the file. The values are dicts.

Return type:

column_dict (dict)

caar.cleanthermostat.pickle_from_file(raw_file, picklepath=None, cycle=None, states=None, sensors_file=None, postal_file=None, auto=None, id_col_heading=None, cycle_col_heading=None, cols_to_ignore=None, encoding='UTF-8', delimiter=None, quote=None, meta=False)[source]

Read delimited text file and create binary pickle file containing a dict of records. The keys are named tuples containing numeric IDs (strings) and time stamps.

See the example .csv data files at https://github.com/nickpowersys/caar.

Example sensor cycle file column headings: DeviceId, CycleType, StartTime, EndTime.

Example sensors file column headings: SensorId, TimeStamp, Degrees.

Example geospatial data file column headings LocationId, TimeStamp, Degrees.

Common delimited text file formats including commas, tabs, pipes and spaces are detected in that order within the data rows (the header has its own delimiter detection and is handled separately, automatically) and the first delimiter detected is used. In all cases, rows are only used if the number of values match the number of column headings in the first row.

Each input file is expected to have (at least) columns representing ID’s, time stamps (or starting and ending time stamps for cycles), and (if not cycles) corresponding observations.

To use the automatic column detection functionality, use the keyword argument ‘auto’ and assign it one of the values: ‘cycles’, ‘sensors’, or ‘geospatial’.

The ID’s should contain both letters and digits in some combination (leading zeroes are also allowed in place of letters). Having the string ‘id’, ‘Id’ or ‘ID’ will then cause a column to be the ID index within the combined ID-time stamp index for a given input file. If there is no such heading, the leftmost column with alphanumeric strings (for example, ‘T12’ or ‘0123’) will be taken as the ID.

The output can be filtered on records from a state or set of states by specifying a comma-delimited string containing state abbreviations. Otherwise, all available records will be in the output.

If a state or states are specified, a sensors metadata file and postal code file must be specified in the arguments and have the same location ID columns and ZipCode/PostalCode column headings in the same left-to-right order as in the examples. For the other columns, dummy values may be used if there is no actual data.

Parameters:
  • raw_file (str) – The input file.
  • picklepath (str) – The path of the desired pickle file. If it is not specified, a filename is generated automatically.
  • cycle (Optional[str]) – The type of cycle that will be in the output. For example, example values that may be in the data file are either ‘Cool’ or ‘Heat’. If left as None, all cycles will be in the output.
  • states (Optional[str]) – One or more comma-separated, two-letter state abbreviations.
  • sensors_file (Optional[str]) – Path of metadata file for sensors. Required if there is a states argument.
  • postal_file (Optional[str]) – Metadata file for postal codes. Required if there is a states argument.
  • auto (Optional[Boolean]) – {‘cycles’, ‘sensors’, ‘geospatial’, None} If one of the data types is specified, the function will detect which columns contain IDs, time stamps and values of interest automatically. If None (default), the order and headings of columns in the delimited text file and the config.ini file should match.
  • id_col_heading (Optional[str]) – Indicates the heading in the header for the ID column.
  • cycle_col_heading (Optional[str]) – Indicates the heading in the header for the cycle column.
  • cols_to_ignore (Optional[iterable of [str] or [int]]) – Column headings or 0-based column indexes that should be left out of the output.
  • encoding (Optional[str]) – Encoding of the raw data file. Default: ‘UTF-8’.
  • delimiter (Optional[str]) – Character to be used as row delimiter. Default is None, but commas, tabs, pipes and spaces are automatically detected in that priority order) if no delimiter is specified.
  • quote (Optional[str]) – Characters surrounding data fields. Default is none, but double and single quotes surrounding data fields are automatically detected and removed if they are present in the data rows. If any other character is specified in the keyword argument, and it surrounds data in any column, it will be removed instead.
  • meta (Optional[bool]) – An alternative way to store metadata about columns, besides the detect_columns() function. To use it, meta must be True, and a dict of metadata will be created instead of a dict of records.
Returns:

Path of output file.

Return type:

picklepath (str)

caar.history module

caar.history.create_sensors_df(dict_or_pickle_file, sensor_ids=None)[source]

Returns pandas DataFrame containing sensor ID, timestamps and sensor observations.

Parameters:
  • dict_or_pickle_file (dict or str) – The object must have been created with dict_from_file() or pickle_from_file() function.
  • sensor_ids (Optional[list or other iterable of ints or strings]) – Sensor IDs. If no argument is specified, all IDs from the first arg will be in the DataFrame.
Returns:

DataFrame has MultiIndex based on the ID(s) and timestamps.

Return type:

sensors_df (pandas DataFrame)

caar.history.create_cycles_df(dict_or_pickle_file, device_ids=None)[source]

Returns pandas DataFrame containing sensor ids and cycle beginning timestamps as multi-part indexes, and cycle ending times as values.

Parameters:
  • dict_or_pickle_file (dict or str) – Must have been created with dict_from_file() or pickle_from_file() function.
  • device_ids (Optional[list or other iterable of ints or strings]) – Sensor IDs. If no argument is specified, all IDs from the first arg will be in the DataFrame.
Returns:

DataFrame has MultiIndex based on the ID(s) and timestamps.

Return type:

cycles_df (pandas DataFrame)

caar.history.create_geospatial_df(dict_or_pickle_file, location_ids=None)[source]

Returns pandas DataFrame containing records with location IDs and time stamps as multi-part indexes and outdoor temperatures as values.

Parameters:
  • dict_or_pickle_file (dict or str) – Must have been created with dict_from_file() or pickle_from_file() function.
  • location_ids (Optional[list or other iterable of ints or strings]) – Location IDs. If no argument is specified, all IDs from the first arg will be in the DataFrame.
Returns:

DataFrame has MultiIndex based on the ID(s) and timestamps.

Return type:

geospatial_df (pandas DataFrame)

caar.history.random_record(dict_or_pickle_file, value_only=False)[source]

Returns a randomly chosen key-value pair from a dict or pickle file.

caar.histsummary module

caar.histsummary.days_of_data_by_id(df)[source]

Returns pandas DataFrame with ID as index and the number of calendar days of data as values.

Parameters:df (pandas DataFrame) – DataFrame as created by history module.
Returns:DataFrame with count (‘Days’) for each ID.
Return type:days_data_df (pandas DataFrame)
caar.histsummary.consecutive_days_of_observations(sensor_id, devices_file, cycles_df, sensors_df, geospatial_df=None, include_first_and_last_days=False)[source]

Returns a pandas DataFrame with a row for each date range indicating the number of consecutive days of data across all DataFrames given as arguments. The starting and ending day of each date range are also given. Only days in which all data types have one or more observations are included.

Parameters:
  • sensor_id (int or str) – The ID of the device.
  • devices_file (str) – Path of devices file.
  • cycles_df (pandas DataFrame) – DataFrame as created by history module.
  • sensors_df (pandas DataFrame) – DataFrame as created by history module.
  • geospatial_df (Optional[pandas DataFrame]) – DataFrame as created by history module.
Returns:

DataFrame with ‘First Day’, ‘Last Day’, and count (‘Consecutive Days’) for each set of consecutive days, for the specified ID.

Return type:

consecutive_days_df (pandas DataFrame)

caar.histsummary.daily_cycle_sensor_and_geospatial_obs_counts(sensor_id, devices_file, cycles_df, sensors_df, geospatial_df=None)[source]

Returns a pandas DataFrame with the count of observations of each type of data given in the arguments (cycles, sensor observations, geospatial observations), by day. Only days in which all data types have one or more observations are included.

Parameters:
  • sensor_id (int or str) – The ID of the device.
  • devices_file (str) – Path of devices file.
  • cycles_df (pandas DataFrame) – DataFrame as created by history module.
  • sensors_df (pandas DataFrame) – DataFrame as created by history module.
  • geospatial_df (Optional[pandas DataFrame]) – DataFrame as created by history module.
Returns:

DataFrame with index of the date, and values of ‘Cycles_obs’, ‘Sensors_obs’, and ‘Geospatial_obs’.

Return type:

daily_obs_df (pandas DataFrame)

caar.histsummary.daily_data_points_by_id(df, devid=None)[source]

Returns a pandas DataFrame with MultiIndex of ID and day, and the count of non-null raw data points per id and day as values.

Parameters:
  • df (pandas DataFrame) – DataFrame as created by history module.
  • devid (Optional[int or str]) – The ID of a device.
Returns:

DataFrame indexed by date, and with counts of observations as values.

Return type:

daily_obs_df (pandas DataFrame)

caar.histsummary.df_select_ids(df, id_or_ids)[source]

Returns pandas DataFrame that is restricted to a particular ID or IDs (device ID, or location ID in the case of geospatial data).

Parameters:
  • df (pandas DataFrame) – DataFrame that has been created by a function in the history or histsummary modules (it must have a numeric ID as the first or only index column).
  • id_or_ids (int or str, list of ints or strs, or tuple) – A tuple should have the form (min_ID, max_ID)
Returns:

daily_obs (pandas DataFrame)

caar.histsummary.df_select_datetime_range(df, start_time, end_time)[source]

Returns pandas DataFrame within a datetime range (slice). If end_time is specified as None, the range will have no upper datetime limit.

Parameters:
  • df (pandas DataFrame) – DataFrame that has been created by a function in the history or histsummary modules (it must have a numeric ID as the first or only index column).
  • start_time (str or datetime.datetime) – Datetime.
  • end_time (str or datetime.datetime) – Datetime.
Returns:

dt_range_df (pandas DataFrame)

caar.histsummary.count_of_data_points_for_each_id(df)[source]

Returns dict with IDs as keys and total number (int) of observations of data as values, based on the DataFrame (df) passed as an argument.

Parameters:df (pandas DataFrame) – DataFrame as created by history module.
Returns:Dict of key-value pairs, in which IDs are keys.
Return type:counts_by_id (dict)
caar.histsummary.count_of_data_points_for_select_id(df, id)[source]

Returns number of observations for the specified device or location within a DataFrame.

Parameters:
  • df (pandas DataFrame) – DataFrame as created by history module.
  • id (int or str) – ID of device or location.
Returns:

Number of observations for the given ID in the DataFrame.

Return type:

data_points (int)

caar.histsummary.location_id_of_sensor(sensor_id, devices_file)[source]

Returns location ID for a device, based on device ID.

Parameters:
  • sensor_id (int or str) – Device ID.
  • devices_file (str) – Devices file.
Returns:

Location ID.

Return type:

location_id (int)

caar.timeseries module

caar.timeseries.cycling_and_obs_arrays(cycles_df=None, cycling_id=None, sensors_df=None, sensor_id=None, geospatial_df=None, start=None, end=None, sensors_file=None, freq='1min')[source]

Returns 2-tuple containing two NumPy arrays: the first is a time series at the specified frequency, and the second is an array of vectors at the specified frequency (‘freq’), such that all data corresponds to the time stamps in the first array. The first column contains ON/OFF status of the cycling device. The remaining column or columns contain sensor and/or geospatial data. For cycle data, ON status is given by 1’s (as floats), and OFF status is given by 0’s. For sensor or geospatial data, intervals without actual observations are filled with numpy.nan.

Parameters:
  • cycles_df (pandas DataFrame) – Cycles DataFrame from history module.
  • cycling_id (int or str) – Cycling device ID.
  • sensors_df (Optional[pandas DataFrame]) – Sensors DataFrame from history module.
  • sensor_id (Optional[int or str]) – Sensor ID.
  • start (datetime.datetime) – First day to include in output.
  • end (datetime.datetime) – Last day to include in output.
  • freq (str) – Frequency, expressed in forms such as ‘1min’, ‘30s’, ‘1min30s’, etc.
  • geospatial_df (Optional[pandas DataFrame]) – Geospatial DataFrame from history module. If there is a geospatial DataFrame, a sensor ID and metadata file for sensors’ locations is needed. See the sensors_file parameter description and example file.
  • sensors_file (Optional[str]) – File path. Is only needed if the sensor has associated geospatial data. The sensors file should contain a location ID with a foreign key column in the geospatial data.
Returns:

The tuple contains a NumPy array of datetimes, and a NumPy array of cycle status (ON/OFF) and sensor and/or geospatial data, with a vector for each datetime. While cycle data points are always always 1 or 0, sensor and geospatial data are numpy.nan in intervals for which there are no recorded observations.

Return type:

times, cycles_and_obs (2-tuple of NumPy arrays)

caar.timeseries.on_off_status(df, id=None, start=None, end=None, freq='1min')[source]

Returns a tuple of two NumPy arrays: a 1D NumPy array with datetimes, and a NumPy array with corresponding ON/OFF status as 1 or 0 (numpy.int8) for each interval at the frequency specified.

Parameters:
  • df (pandas DataFrame) – The DataFrame should contain cycles data, and should have been created by the history module.
  • id (int or str) – Device ID.
  • start (datetime.datetime) – Starting datetime.
  • end (datetime.datetime) – Ending datetime.
  • freq (str) – Frequency in a pandas-recognized format. Default value is ‘1min’.
Returns:

1D NumPy array with Python datetimes and 1D NumPy array of ON/OFF status as ints (numpy.int8).

Return type:

A 2-tuple (tuple)

caar.timeseries.sensor_obs_arr_by_freq(df, id=None, start=None, end=None, cols=None, freq='1min', actuals_only=False)[source]

Returns tuple of NumPy arrays containing 1) indexes including timestamps (‘times’) and 2) sensor observations at the specified frequency. If actuals_only is True, only the observed temperatures will be returned in an array. Otherwise, by default, intervals without observations are filled with zeros.

Parameters:
  • df (pandas DataFrame) – DataFrame with temperatures from history module.
  • id (int or str) – Device ID or Location ID.
  • start (datetime.datetime) – First interval to include in output array.
  • end (datetime.datetime) – Last interval to include in output array.
  • cols (Optional[str or list of str]) – Column heading/label or list of labels for column(s) should be in the output (array) as data. By default, the first data column on the left is in the output, while others are left out.
  • freq (str) – Frequency of intervals in output, specified in format recognized by pandas.
  • actuals_only (Boolean) – If True, return only actual observations. If False, return array with zeros for intervals without observations.
Returns:

  1. ‘times’ (datetime64[m]) and 2) ‘temps’ (numpy.float16).

Return type:

temps_arr (structured NumPy array with two columns)

caar.timeseries.plot_cycles_xy(cycles_and_obs)[source]

Returns 2-tuple for the purpose of x-y plots. The first element of the tuple is an array of datetimes. The second element is an array of cycling states (1’s and 0’s for ON/OFF). The argument must be a return value from the function cycling_and_obs_arrays().

Parameters:cycles_and_obs (tuple of NumPy arrays) – The tuple should be from cycling_and_obs_arrays().
Returns:The first tuple (which can be plotted on the x-axis) holds timestamps (datetime64).
Return type:times_x, onoff_y (tuple of NumPy arrays)
caar.timeseries.plot_sensor_geo_xy(cycles_and_obs)[source]

Returns x and y time series where x holds timestamps and y is a series of either sensor observations, geospatial data observations, or both, depending on the argument. The single argument must be the return value (a 2-tuple) from the function cycling_and_obs_arrays().

Parameters:cycles_and_obs (tuple of NumPy arrays) – The tuple should be from cycling_and_obs_arrays().
Returns:The first tuple (which can be plotted on the x-axis) holds datetimes. The second has corresponding data from sensors or from geospatial data sources. Only non-null observations are returned.
Return type:x, y (tuple of NumPy arrays)