How to Use - Uchuu-BigData@CESGA

How does it work?

Set up the VPN

A VPN is necessary to connect to jupyter

Getting started

Importing

Before you start, remember to import this:


                        import pyspark.sql.functions as F
                        from pyspark.sql.functions import col, expr

Reading Halos

We start loading all the halos:


                        %%time
                        all_halos = spark.read.parquet('/user/csiaafpr/RockstarExtendedParquet/')

CPU times: user 3.03 ms, sys: 2.84 ms, total: 5.87 ms
Wall time: 22.2 s
But before we proceed with our anlysis it is always important to restrict them to the redshift we are interested in:


                        halos = all_halos.where(col('redshift') == 0)

The list of available redshifts can be found here: UCHUU Snapshot Redshift correspondences

We can also select several redshifts:


                        halos = all_halos.where((col('redshift') == 1.54) | (col('redshift') == 0.49))

Or even a range of redshits:


                        halos = all_halos.where((col('redshift') < 1.54) & (col('redshift') > 0.49))

If you are familiar with SQL you can also express the condition as a SQL predicate:


                        halos = all_halos.where('redshift > 0.10 and redshift < 0.50')

Select HOST halos for Z=0 (redshift=0) in the Mvir range cmass_min - cmass_max


                        cmass_min = 2.00e15
                        cmass_max = 2.03e15
                         
                        hosts = all_halos.where((col('redshift') == 0.49)
                        & (col('pid') == -1)
                        & (col('Mvir') > cmass_min)
                        & (col('Mvir') < cmass_max))

In the most generic case, we can even restrict the halos selected indicating additional conditions they must fullfil and selecting just part of the columns of the dataframe, including additional computed ones and taking just a random sample of the halos:


                        halos = (all_halos.where((col('redshift') == 1.54)
                        & (col('pid') == -1)
                        & (col('Mvir') > 1.0e14)
                        & (col('Mvir') < 1.3e14)
                        & (col('Xoff')/col('Rvir') < 0.05)
                        & (col('Spin') < 0.03))
                        .select('id', 'x', 'y', 'z', 'vx', 'vy', 'vz', 'Mvir', 'Rvir', expr('Rvir/Rs_Klypin'))
                        .sample(0.08))

Count the number of halos we have selected:


                        %%time
                        halos.count()

CPU times: user 5 ms, sys: 1.21 ms, total: 6.21 ms
Wall time: 26 s
31
Show two of the halos in our selection (keep in mind that it will have to compute the additional columns so this will take some time):


                        %%time
                        halos.show(2)

+--------------+------------------+-------+-------+------+------+-------+--------+-------+------------------+
|            id|                 x|      y|      z|    vx|    vy|     vz|    Mvir|   Rvir|(Rvir / Rs_Klypin)|
+--------------+------------------+-------+-------+------+------+-------+--------+-------+------------------+
| 4226292660622|2.8652900000000003|623.168|388.151|-41.45|-352.1|-256.26|1.013E14| 1140.7|  6.73086568361922|
|81020308687922|           563.068|1758.17|1832.21|-13.99| 91.62| -99.75|1.135E14|1184.58|  6.28584467132214|
+--------------+------------------+-------+-------+------+------+-------+--------+-------+------------------+
only showing top 2 rows

CPU times: user 6.52 ms, sys: 1.64 ms, total: 8.16 ms
Wall time: 37.4 s

Data analysis

To get inspiration about how to process the data and do some actual computations you can look at the computing_subhalo_clusters.ipynb notebook.

Saving the results

In case you want to save the resulting dataframe you use:


                        halos.write.parquet('halos')

Parquet is the recommended format to save data.
You can later load again the dataframe from the stored data simply executing:


                        halos = spark.read.parquet('halos')

The results will be stored in HDFS.

You can download them to your HOME directory and from there transfer them by SCP to your PC (just keep in mind that the size of the resulting dataframe is reasonable for this):
hdfs dfs -get hosts
If you prefer, you can also save the results using less efficient formats like CSV or JSON.


                        halos.write.csv('halos-csv')


                        halos.write.json('halos-json')

Work in queue

To submit a queued job you should connect to Uchuu using SSH, using Jupyter will allow you to only use Dynamic jobs.