PySpark Without Code in Fabric - How to Use the DataWrangler in MS Fabric - video Dailymotion

Xcellerate

Watch PySpark Without Code in Fabric - How to Use the DataWrangler in MS Fabric - Xcellerate on Dailymotion

Transcript

00:00Hey friends! Welcome to this video where I want to dive a little bit deeper into notebooks to

00:11write PySpark or Pandas code. Now I know that not everyone is a data engineer and maybe you don't

00:16have the skills to write the code yourself. This is why Fabric comes to rescue because it also

00:21allows us to do any kind of data transformations in a drag-and-drop way. And how that works,

00:26we can explore right now. For that, you can see that I went back to my Fabric trial workspace.

00:31So you just go to workspaces and then choose the specific workspace you want to use. In this case,

00:35I'm going to use this one. And now we create a new notebook. So we already covered this. So what

00:41we can do, we can either go inside new item up there and choose the notebook and then map for instance

00:46the lake house when you want to attach one or the warehouse when we want to attach one ourselves.

00:50Or if we want to skip this step, we can also go directly into the lake house we want to attach

00:56for instance this one. And then we can go to open notebook and just choose in this case a new

01:01notebook. So click on new notebook for now. And then we just open this notebook. In this case,

01:07we call this here data wrangler. Let's call it data wrangler because this is the right name for it.

01:14And then let's explore what the data wrangler actually can do for us. So now we have the new

01:19notebook. And we also can see that because we created it from the lake house, the lake house directly is

01:24attached. Same is true by the way for the warehouse as I said already before. But now we want to use

01:29lake house data for that. So click on the lake house and we can also see that there are the table

01:33section and there are also the files available. And now depending on what you want to do, we can use

01:37either PySpar or we can use pandas, right? And to load the data, we also know that we can write

01:42this ourselves the code or we could also go to the three dots and then load the data ourselves.

01:47So for instance, if I click on this orders here, I can take this option. I can say load data and then I can use

01:53spark data in here. And then I get this spark.sql and then select star from fabric.lakehouse.dbo.orders

01:59limit 1000. And this one more time, this works because we attached the lake house to this notebook.

02:04So you can read it directly from, in this case, the lake house name.schema.gablename.

02:10Okay, so this is maybe our example. So we can actually run this. So click on the run cell and then we just

02:15wait until the notebook starts. You can see here it's connecting and then we should see the result.

02:20Okay, and it's finished. And now we can see it took around two and a half minutes for me. And now I can

02:27see my data frame in here. So this is basically the data frame load. So far so good. Now next, let's

02:33actually try out the data wrangler. Now you can clearly see already the data wrangler is up there.

02:38This is one option which you can choose. And then it starts searching for data frames inside our notebook.

02:43In this case, it has already detected one data frame, which is this DF data frame.

02:48This one here. The other option, you can also see that the data wrangler is also there for this

02:53specific display. So if you go to data wrangler there, it will also guide you directly into the

02:57interface of the data wrangler. And now this is the interface for everyone who wants to use PySpark

03:03code for instance in transformations or Python transformations, but is not able to write

03:07the code or does not want to write the code. So either way, now for that we have the data wrangler in

03:13Microsoft fabric. Now you can see that you've got a preview of the data. You can scroll through it,

03:16but this is exactly what we've seen before. And you can also get here some kind of, well,

03:21statistics regarding the columns you have inside. But much more interesting, of course, are the

03:27transformations. And for instance, what we could do is we could use some of them right now. For instance,

03:33here we have, let's see, different kinds of columns. That's fine. And here, for instance,

03:37we have a region, which is America. Now in Europe, these are the two regions. Let's say instead of

03:42America, we would like to have Americas. So the plural of that. So we could go to the region section,

03:50first select the column, then go to find and replace. And then we say we would like to find

03:55and replace. And inside the region, the old value is, for instance, America. And the new value is

04:02Americas. Americas, like that. So then we can also specify, you want to match the full string,

04:09or match case, or use regular expressions. I want to get match the full string. I select this option

04:13here. And then we basically could go. So a region, America, you can see that there is the old one,

04:19which is highlighted in red. And there is the new one, which is highlighted in green. And we can clearly

04:25see the change. So now there is an S for America. And if you like this change, of course, you can apply it.

04:30Otherwise, of course, you can discard it. But let's say I like to apply it. But before I click this,

04:34please have a look at what is created up down there. So I'm going to replace it is here. Well,

04:40converted this first into pandas. And it shows us here the code, which is generated. So in this case,

04:45the block pandas region string of lower America, and this converts to Americas. So that's happened here.

04:50Please also note this little hint here, your spark data frame has been converted into a pandas

04:56sample for performance. But all the code generated by data wrangler will be converted to PySpark when

05:02you add it to the notebook. That's important. So even if you see Python code here, this will convert it

05:07into PySpark. So your data frame itself will not be converted. So it will stay, in this case,

05:11a PySpark data frame, and the transformations will then be applied on this specific PySpark data frame.

05:16So that's fine. Click on apply for now. Okay, here we go. And we've applied our first transformation,

05:22which you can find down there as well. So here the cleaning steps, this is kind of similar to what

05:26we know already from the data frame gen 2. So there also we have the applied steps window. And here we

05:32also have the cleaning steps window, which allows us also to click on the prior steps, we can see the

05:36results, and we can also delete a step. So if we say we don't need this step anymore, just delete it,

05:41and then it's gone, right? So let's click on the last operation we were, this was with the Americas.

05:47And now let's also maybe change the format, change the schema, let's say. So if you want to, whenever

05:53you want to drop columns, or add additional things, you can go to the schema option here. So go to schema,

05:58like that. And then you can see you can change the column type, clone a column, drop a column,

06:02rename a column, and so on. So let's say I'd like to drop a column. And then I don't want to drop the

06:07region. This was currently selected. Let's go in here and maybe drop other columns. Let's say I'd like to

06:13drop the customer zip code, I don't need this. And let's say I also like to drop maybe the order ID,

06:19the customer ID, the product ID, and the employee ID. So just as an example, and let's click on apply.

06:24And first, please, let's have a look at this. You also see the highlight in red. And down there,

06:29again, you can see the code. So simply, that has to drop, in this case, the column names.

06:33So if we go to apply, this step, then we should see that now the ID columns are gone. Now we have less

06:39columns in our data frame. So this job has also been applied. And you can see that five columns

06:44have been dropped. And one more time, you can still can see here the delete option and so on.

06:49That's also available for us in here. Now, of course, you find many additional transformations

06:55here. And let's have a quick look what's available. So find and replace, we have drop duplicate rows,

07:00drop missing rows, which is often used in data cleaning transformation steps, fill missing rows,

07:05file and replace we've seen. Under format, we find various options to capitalize letters to convert

07:10text to lowercase, uppercase, and so on. We have the option to split text and strip white space.

07:15Under formulas, we have the option to binarize labels to one-hot encode labels, which is important

07:20to transformation for machine learning, which is also available. We can calculate text length.

07:25Under numeric, we have some options like rounding, like scaling, and so on. These options here,

07:29schema we've seen. And sort and filter is also available, like filter sort. And group by an aggregate is also

07:35available. So a lot of transformations in here, which might be quite tedious to know all the

07:40PI smartcode, but here we can simply click them together on our data frame. And as soon as we are

07:45done, all we need to do is we go to here, add code to notebook. By the way, you can also save the

07:50transformed file, in this case, the CSV file, and download it if you want. You can also copy the code

07:56to clipboard, if you want to copy, save the code stored somewhere. Or if you want to apply it, simply go to

08:01add to code to notebook. Click this option. And yes, I want to do this. Click on add in here. And then

08:09if you scroll down to the bottom, you should find here, this is the code generated. But this time,

08:15as the message already said, this is PySpark code. So even we saw the Python code in the preview,

08:22this now is PySpark code. And it creates a function. And this function is then applied to the data frame

08:27itself. And then, of course, if we do this, if I run this, so press shift enter, or press the run

08:33command here, and just wait until we get the output, we should see that after applying these,

08:38we have dropped the columns. And we should also see America in our data set, Americas to be specific.

08:43So you see here, see, now it starts with order date, which is here, if I scroll up here, the order

08:48date actually was this column, but we remove these columns right in the data render. And if I scroll down

08:53further in here, let's go further down there, you can see here, Americas, the region has also been

08:59applied to step. So this is how the data render works. It's quite great. It's a great feature,

09:04because it extends the functionality of the notebooks. And it also allows users who might not be familiar

09:09with PySpark or with Python code, or who don't want to write it to quickly transform the data. And you

09:15can also go back in the data render here, if you click this option, right, then you can actually do new

09:19transformations here as well. So if you want to do this. So let's go back. This was just for this

09:27case, the PySpark code. But I just want to highlight this also, it's also available for pandas. So that

09:33just means if I go to my lake house here, and I load another table, so this time from the file section,

09:39if I click on files, and then for instance, I like to use the Pokemon file here, right, so Pokemon update

09:44to be specific, I can click three dots, and I load the data, but this time not to spark, but instead I

09:50load it to pandas. So if I click on pandas, and then maybe let's at least rename this. So let's say

09:56dfp for because it's pandas in this case, and then display dfp for like that. Okay, here we go. And

10:05let's run this. And then we just wait. And now this time we see another, well, table. But this time,

10:10of course, it's pandas. Now this time is a Python pandas data frame. This is not a PySpark data frame.

10:16But the same is true here as well. The data wrangler works no matter if you have a Python

10:21data frame, so pandas data frame, or you have a PySpark data frame. So if I go to data wrangler in here,

10:26I can do exactly the same transformations. So the data is loaded, as you can see here,

10:30I got my Pokemon tables, Pokemon table in here, and I can do my transformations here. So if for instance,

10:35I say there is normal, poison, and so on, find and replace, we'll also try that. So let's actually

10:40maybe catalyze a letter or make everything lowercase. So let's say I use type 1 in here,

10:46and then I want to actually convert this to lowercase. I can then say I would like to convert

10:50text to lowercase. I can take this option. Type 1 is already selected. You can of course choose

10:55several columns, type 2 and so on. Of course, here, Fabric is smart enough to type out only the string

11:01columns because we cannot convert a numerical type, right? So that's why it only offers these

11:05options in here. But that's fine. Let's click on apply. We also see the preview one more time.

11:10We click the apply option, and then at this time, this tab is applied. And the same works for other

11:16transformations for now. I think that's fine. For an example, I think you get the point,

11:20and if you're done with that, you can add the code to notebook. Click on add to notebook,

11:25and if I scroll now to the bottom, it works exactly the same way. We have a function which is created.

11:30This time, this, of course, is normal. So in quote, normal Python code, it's not

11:35PySpark code, but it's supplied with a clean data here function also to DFP. And now if we execute this

11:41and just see the results, we should see that now we have here, in this case, we only have the add,

11:46but now we have here the transformation. And you can see that type 1 here is lowercase, right?

11:51So it works exactly the same way. So no matter if you have a PySpark data frame or if you have a

11:56panels data frame, you can use the data wrangler in order to do the transformations. So they're

12:00heavy lifting without writing a single line of code yourself. And after doing this, then of course,

12:06you can write this back either into the file section, if you want to store these files,

12:10or also, of course, in the table section, if you want to save it as delta tables. And we have already

12:14seen how to do this. So that's it for the data wrangler. Hopefully, you enjoyed it. Hopefully,

12:18you understood how it works and what we did. If you've got questions, let me know. Otherwise,

12:22give it a try yourself. And hopefully, see you in the next video. Until then. Peace, guys.

PySpark Without Code in Fabric - How to Use the DataWrangler in MS Fabric

Category

Transcript

Be the first to comment

Recommended