We can use delayed functions in Dask to compute the fraction of long trips in a dataset. Here's an example:
import daskimport dask.bag as dbfrom dask import delayed@delayeddef is_long_trip(trip): duration = trip['duration'] return duration > 1200@delayeddef count_long_trips(trips): long_trips = sum(trips) return long_tripsfilenames = ['data/2017-06.csv', 'data/2017-07.csv', 'data/2017-08.csv']b = db.read_csv(filenames)long_trip_counts = b.map(lambda trip: is_long_trip(trip)).map(count_long_trips)total_trips = long_trip_counts.sum()fraction_long_trips = total_trips / b.count().compute()print(fraction_long_trips) |
In this example, we define two delayed functions. The first function is_long_trip takes a trip as input and returns True if the trip is longer than 1200 seconds (20 minutes), otherwise it returns False. The second function count_long_trips takes a list of True/False values and returns the count of True values.
We then create a list of filenames filenames and use Dask's read_csv function to create a bag b containing all the trips in the specified CSV files.
We use the map method to apply the is_long_trip function to each trip in the bag. This returns a bag of True/False values indicating whether each trip is long or not. We then apply the count_long_trips function to each bag of True/False values to get the count of long trips for that bag.
We then use the sum method to compute the total count of long trips across all bags. Finally, we divide the total count of long trips by the total count of trips in the dataset to get the fraction of long trips.
Using delayed functions in this way allows us to perform complex operations on large datasets by breaking them down into smaller computations that can be executed in parallel.