Skip to content

Long term data format design considerations

K. Shankari edited this page Sep 23, 2015 · 9 revisions

Long term data format design considerations

Requirements:

  1. Our basic model is an infinite length linked list where the nodes are locations and the links are trips/sections. In order to make navigation of this list easier, each node and edge will have front and back pointers.
  2. We can retrieve any subsequence of this list by providing start and end times, or by querying on the node and edge decorators.
  3. The edges should support decorations. Some examples are:
  • Fuel consumption: When we integrate with the bluetooth OBD port reader, we will get a graph of fuel consumption wrt time, which can be a decoration for automobile trips
  • Road surface: When we integrate with the accelerometer, we can get a graph of pavement quality over time (ala biketastic)
  • Temperature: We can integrate with the temperature sensor to get a graph of ambient temperature over time - was that bus/train overheated or overcooled or comfortable?
  • Mood: We can integrate with mood prompts to provide an indication of travel quality over time...

Big question:

Should the segmentation into various modes be part of the original structure or should it be a decoration? A summary of the pros and cons with some examples to flesh this out further.

  1. Argument for structure: The notion of a trip consisting of a set of segments seems pretty natural. Each of those decorations can apply to a section in addition to a trip.

  2. Trip: { 'trip_id': 'trip_id_1', 'sections': [section_id_0, section_id_1, section_id_2] 'start_time': 343534343, 'end_time': 3435343, 'start_point': 556565, 'end_point': 4343434 }

  3. Section { 'section_id': 'section_id_0', 'trip_id': 'trip_id_1', 'start_time': 343534343, 'end_time': 3435343, 'start_point': 556565, 'end_point': 4343434, 'mode': 'walking' 'track_points': [ .... ] }

  4. Fuel Consumption { 'section_id': 'section_id_0', 'fuel_consumption': [ {'ts': 343534343, 'fuel': 34343, }, {'ts': 343534343, 'fuel': 34000, } ] }

  5. Mood { 'section_id': 'section_id_0', 'mood': [ {'ts': 343534343, 'mood': 'good', }, {'ts': 343535343, 'mood': 'bad', } {'ts': 343536343, 'mood': 'shitty', } ] }

  6. Argument for decoration: Trips sections are only really useful for multi-modal trips. We want to have a lot more multi-modal trips, and we think that there will be a lot more multi-modal trips, but are they so fundamental that they should go into the main structure?

  7. Trip: { 'trip_id': 'trip_id_1', 'start_time': 343534343, 'end_time': 3435343, 'start_point': 556565, 'end_point': 4343434 'track_points': [ ] }

  8. Section { 'section_id': 'section_id_0', 'trip_id': 'trip_id_1', 'start_time': 343534343, 'end_time': 3435343, 'start_point': 556565, 'end_point': 4343434, 'mode': 'walking' }

  9. Fuel Consumption { 'trip_id': 'trip_1', 'start_ts': 3435343, 'fuel_consumption': [ {'ts': 343534343, 'fuel': 34343, }, {'ts': 343534343, 'fuel': 34000, } ] }

  10. Mood { 'section_id': 'section_id_0', 'start_ts': 3435343, 'mood': [ {'ts': 343534343, 'mood': 'good', }, {'ts': 343535343, 'mood': 'bad', } {'ts': 343536343, 'mood': 'shitty', } ] }

  11. More radical option: There is no structure on the locations - they are basically a time series of (potentially cleaned points). The trips and locations are decorations on the location time series. In this case, the trips and sections can be combined to form a single, hierarchical, decoration. The advantage of this option is that we can store the locations in a separate database (a timeseries database such as readingDB or quasar?) from the decorations database (SQL? NoSQL?)

  12. Location: Note that we may need to split these into multiple streams if the database does not support storing multiple fields per timestamp. { 'timestamp': 223232423, 'coordinates': [37.2341, -122.4581, 5.4587], 'accuracy': 0.5678, 'heading': 95.81, 'speed': 21.15 }

  13. Trip: { '_id': 'trip_id_1', 'start': can I add a reference to a location here? If not, we can use the start_ts and end_ts to retrieve it from the database OR 'start_ts': 343534343, 'end_ts': 3435343, 'sections: [ { 'section_id': 'section_id_0', 'trip_id': 'trip_id_1', 'start_time': 343534343, 'end_time': 3435343, 'start_point': 556565, 'end_point': 4343434, 'mode': 'walking' } ] }

  14. Fuel Consumption [ {'timestamp': 343534343, 'fuel': 34343, }, {'timestamp': 343534343, 'fuel': 34000, } }

  15. Mood { "good": 3, "bad": 2, "shitty": 1 } [ {'timestamp': 343534343, 'mood': 3, }, {'timestamp': 343535343, 'mood': 2, } {'timestamp': 343536343, 'mood': 1, } ]

Note that the difference between the three is basically the location where the points are stored (Trip versus Section versus separate DB). The implications behind this are:

  1. The decorations might apply to only part of the trip - for example, the fuel consumption would only apply to car trips, while the pavement quality would not apply for walking trips, for example. We can work around this in one of two ways:
  2. We can specify the start and end of the decoration (as in the fuel consumption example above)
  3. We can allow decorations to be associated with section_ids or trip_ids
  4. In the first two alternatives, in order to work with trips, we have to access only a single object, but in order to access a section, we need to access both the section and the underlying trip. We plan to provide libraries/queries that can retrieve subsections of the trip points that will make this easier. In the third alternative, we will always have to make calls to multiple databases.
  5. In order to use the decorations in the UI, we will need to retrieve a bunch of decorations in addition to the trips and stitch them together on the client. This is not necessarily a bad thing, but we need to think about it.

Time considerations

As always, time is a giant hassle to deal with because of stupid timezones. In general, in order to separate the modelling and presentation layers, we should separate the timestamp (model) from the presentation (timezone based formatting). However, in this case, we need to consider how to represent times that occurred in the past when the user was in a different timezone. In particular, consider the following use case:

  1. The user lives on the west coast, so her default timezone is Pacific.
  2. She goes on a trip to the east coast. Let us assume that she visits the Statue of Liberty from 10am to 1pm, Eastern time.
  3. She returns to the west coast and then reviews her diary for the day. If we format the timezone according to her current timezone, the trip will show as having occured from 7am to 10am.

It is unlikely that she is likely to appreciate this because it does not match her experience. So we really need to know the timezone for each trip.

This gets even more complex when you consider trips that span timezones. Consider the flight that the same user took to return from the east coast to the west coast. She might have departed at 3pm Pacific and arrived at 11pm Pacific, but in order to match her mental model, we need to display a departure of noon and an arrival of 11pm. So we need to support multiple timezones per trip.

Decision

  1. 'Structure versus decoration': The feedback that I have so far on the data structure is in favor of the decoration approach above, and I think that is more principled as well. We will take the even more general approach of treating the set of location points as a time series and treating the trips as decorations (option 3 above).
  2. 'Decorations that only apply to sections': Specify start and end points (approach 1). This is because it allows us to support decorations that only apply to sub parts of sections as well (i.e. maybe only when we are going uphill)
  3. 'Timezones': We need to support multiple timezones per trip. We will handle this by making the timezones a decoration for the trip just like everything else. The decoration will indicate the timezone changes, just like the mode changes. We could determine the timezones on the client, just like we determine the modes, and send them over to the client for appropriate formatting.

Final design

Here's how we plan to represent the currrent decorators that we have

Location database

Time series database that contains a sequence of locations. Each location will have the following fields. Note that we may need to store these as multiple separate series if we can't find a timeseries database that supports complex objects.

  1. 'latitude': the latitude of the location (in degrees)
  2. 'longitude': the longitude of the location (in degrees)
  3. 'altitude': the altitude of the location (in meters)
  4. 'accuracy': the horizontal accuracy of the location point (in meters)
  5. 'speed': the speed of the point (unsure if this is different from the speed that you compute just by looking that the distance and time from the previous location point)
  6. 'heading': also called course (degrees, starting at due north and moving clockwise, north = 0 degrees, east = 90 degrees, south = 180 degrees)
  7. 'vaccuracy': vertical accuracy (in meters, iOS only). For now, we keep them separate, but it would be better to unify them
  8. 'floor': floor of the building where the user is located (iOS only)
  9. 'ts': timestamp in sections since 1970. While choosing between seconds and milliseconds, picked seconds since both iOS and python use it. We can just convert the android version.
  10. 'formatted_ts': 'ts' formatted in the appropriate timezone. This is useful for us while we are investigating the data.
  11. 'extras': dict of everything that is not standard (hasAccuracy, hasBearing, initialBearing, lat1, lat2...). We do not plan to use any of this right now, but it is useful in case we need the data later. Storage is cheap.

Trip database

Contains a ordered list of trips. Each trip will have the following fields:

  1. '_id': unique ID for the trip
  2. 'start_time': UTC timestamp for the start time
  3. 'end_time': UTC timestamp for the end time
  4. 'start_place': the id of the place entry just before this trip
  5. 'end_place': the id of the place entry just after this trip

Place database

Contains a ordered set of places. Each place will have the following fields:

  1. '_id': unique ID for the place
  2. 'enter_time': UTC timestamp for when the place was entered
  3. 'ending_trip': foreign key of the trip that ended when we got to this place
  4. 'exit_time': UTC timestamp for when the place was exited
  5. 'exit_time': UTC timestamp for when the place was exited
  6. 'starting_trip': foreign key of the trip that starts when we exit this place
  7. 'coordinates': this can just refer to an entry in the location table, but I don't know if it is easy to refer to entries in a timeseries DB. I guess this would be the timestamp of the "canonical" location. Alternatively, we can copy the information here for easy access. It would be useful to duplicate at least the (lat, lng, alt) tuple here for easy access.

Section database

  1. '_id': unique ID for the section
  2. 'trip_id': foreign key from the trip table
  3. 'prev_section': foreign key for the previous section
  4. 'next_section': foreign key for the next section
  5. 'start_time': UTC timestamp for the start time
  6. 'end_time': UTC timestamp for the end time
  7. 'start_location': unsure what I should store here. Should we have places for transfer points? Maybe a separate transfer point database?
  8. 'end_location': ditto

'Untracked` database

Times when the tracking was turned off (either manually or by running out of power)